THE ERGODIC AND COMBINATORIAL APPROACHES TO 

SZEMEREDI'S THEOREM 

TERENCE TAO 

Abstract. A famous theorem of Szemeredi asserts that any set of integers of positive 
upper density will contain arbitrarily long arithmetic progressions. In its full gener- 
ality, we know of four types of arguments that can prove this theorem: the original 

£NJ ■ combinatorial (and graph-theoretical) approach of Szemeredi, the ergodic theory ap- 

; I ' proach of Furstenberg, the Fourier-analytic approach of Gowers, and the hypergraph 

Q-i approach of Nagle-Rodl-Schacht-Skokan and Gowers. In this lecture series we intro- 

-^ duce the first, second and fourth approaches, though we will not delve into the full 

details of any of them. One of the themes of these lectures is the strong similarity 

fvq ' of ideas between these approaches, despite the fact that they initially seem rather 

different. 
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1. Introduction 



These lecture notes will be centred upon the following fundamental theorem of Sze- 
meredi: 



Theorem 1.1 (Szemeredi's theorem). ^0] Let A C Z be a subset of the integers of 
positive upper density, thus limsup^^^ [ 2N+1 > 0- (Here and in the sequel, we 
use \B\ to denote the cardinality of a finite set B.) Then A contains arbitrarily long 
arithmetic progressions. 

This theorem is rather striking, because it assumes almost nothing on the given set A 
- other than that it is large - and concludes that A is necessarily structured in the sense 
that it contains arithmetic progressions of any given length k. This is a property special 
to arithmetic progressions (and a few other related patterns). Consider for instance the 
question asking whether a set A of positive density must contain a triplet of the form 
{x, y,x + y}. (Compare with the triplet {x, y, ^^}, which is an arithmetic progression 
of length three.) It is then clear that the odd numbers, which are certainly a set of 
^ positive upper density, do not contain such triples (see however Theorem 16.11 below). 

Or for another example, consider whether a set of positive upper density must contain 
a pair {x,x + 2}. The multiples of 3 provide an immediate counterexample. (This is 
basically why the methods from |25 j can leverage Szemeredi's theorem to show that 
the primes contain arbitrarily long arithmetic progressions, but are currently unable 
to make any progress whatsoever on the twin prime conjecture.) But the arithmetic 
progressions seem to be substantially more "indestructable" than these other types of 
patterns, in that they seem to occur in any large set A no matter how one tries to 
rearrange A to eliminate all the progressions. 

We have contrasted Szemeredi's theorem with some negative results where the selected 
pattern need not occur. Now let us give the opposite contrast, in which it becomes very 
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easy to find a pattern of a certain type in a set. Here is a basic example (a special case 
of a result of Hilbert): 

Proposition 1.2. Let A C Z /lave positive upper density. Then A contains infinitely 
many "parallelograms" {x, x + a,x + b,x + a + b} where a, b ^ 0. 

Note that if we could just set a = b in these parallelograms then we could find infinitely 
progressions of length three. Alas, things are not so easy, and while progressions are cer- 
tainly intimately related to parallelograms (and more generally to higher- dimensional 
parallelopipeds, for which an analogue of Proposition 11.21 can be easily located), the 
existence of the latter does not instantly imply the existence of the former without sub- 
stantial additional effort. For example, one can easily modify Proposition II . 21 to locate, 
for any k > 1, infinitely many parallelopipeds of the form {p+J2 ieA Xi : A C {1, . . . ,k}} 
in the primes {2, 3, 5, . . .}, where p is a prime and Xi, . . . , Xk > are positive integers, 
but this appears to be of no help whatsoever in locating long arithmetic progressions 
in the primes (one would need to somehow force all the xi to be equal, which does not 
seem easily accomplishable). 

Proof. Since A has positive upper density, we can find a 5 > and arbitrarily large 
integers N such that 

\Af][-N,N]\ >5N. 

Now consider the collection of all differences x — y, where x, y are distinct elements of 
A fl [-N, N]. On one hand, there are 5N(5N — 1) possible pairs (x, y) that can generate 
such a difference. On the other hand, these differences range from — 2N to 2N, and 
thus have at most 4N possible values. For N sufficiently large, 5N(5N — 1) > 4N, 
and hence by the pigeonhole principle we can find distinct pairs (x,y),(x',y') with 
x, y, x', y' G A fl [-N, N] and x — y = x' — y' ^ 0. This generates a parallelogram. A 
simple modification of this argument (which we leave to the reader) in fact generates 
infinitely many such parallelograms. □ 

The above argument in fact yields a very large number of parallelograms; if \A C\ 
[-N, N] | > SN, then A fl [-N, N] in fact contains 3> S 4 N 3 parallelograms {x, x + a, x + 
b, x + a + b} . This should be compared against the total number of parallelograms in 
[-N, N], which is comparable (up to multiplicative constants) to iV 3 . Thus the density 
of parallelograms in A fl [-N, N] differs only by polynomial factors from the density of 
A fl [-N, N] itself. If arithmetic progressions behaved similarly, one would expect a set 
A in [-N, N] of density 5 to contain ^> 5 Ck N 2 arithmetic progressions of a fixed length 
k. While this is trivially true for k = 2, it fails even for k = 3: 

Proposition 1.3 (Behrend example). [2j Let < 5 ^ 1 and N > 1. Then there exists 
a subset Ac {1, . . . , N} of density \A\/N ^> 5 which contains no more than S clog «iV 2 
arithmetic progressions {n, n + r,n + 2r} of length three, where c > is an absolute 
constant. 

Proof. The basic idea is to exploit the fact that convex sets in M. d , such as spheres, 
do not contain arithmetic progressions of length three. The main challenge is then to 
somehow "embed" M. d into the interval {1, . . . , N}. To do this, let M, d > 1 be chosen 
later, and let : {1, . . . , N} — > {0, . . . , M — l} d denote the partial base M map 

(f)(n) := ([n/APj modMJto 1 



COMBINATORIAL AND ERGODIC APPROACHES TO SZEMEREDI 3 

where [^J is the greatest integer less than x, and nmodM is the remainder of n when 
divided by M. We then pick an integer R between 1 and dM 2 uniformly at random, 
and let B R C {0, . . . , [M/10J } d be the set 

B R := {(an, . . . ,x d ) E {0, . . . , LM/10j} d : x\ + . . . + x\ = R} 

and then let A R := (j)~ 1 (B R ) C {1,...,N} be the preimage of B R . The set !?# is 
contained in a sphere and thus contains no arithmetic progressions of length three, 
other than the trivial ones {x, x, x}. Because there is no "carrying" when manipulating 
base M expansions with digits in {0,..., LM/10J}, we thus conclude that A R only 
contains an arithmetic progression (n,n + r,n + 2r) when r is a multiple of M d . This 
shows that the number of progressions in A R is at most 0(M~ d N 2 ). On the other hand, 
whenever 0(n) G {0, . . . , M/10} d , then n has a probability 1/dM 2 of lying in A R . Thus 
we have a lower bound 

1^1 » ^io- d - 

If we set d := clog 4 and M := <5 C for some small constants c > we obtain the 
claim. □ 

This example shows that one cannot hope to prove Szemeredi's theorem by an argu- 
ment as simple as that used to prove Proposition II .2\ as such simple arguments invari- 
ably give polynomial type bounds. Remarkably, this 60-year old bound of Behrend is 
still the best known (apart from the issue of optimising the constant c) . 

Another reason why Szemeredi's theorem is difficult is that it already implies the 
much simpler, but still nontrivial, theorem of van der Waerden: 

Theorem 1.4 (Van der Waerden's theorem). [H)] Suppose that the integers Z are par- 
titioned into finitely many colour classes. Then one of the colour classes contains arbi- 
trarily long arithmetic progressions. 

Indeed, from the pigeonhole principle one of the colour classes would have positive 
density, which by Szemeredi's theorem gives infinitely long progressions. The converse 
deduction is far more difficult; while certain proofs of Szemeredi's theorem do indeed 
use van der Waerden's theorem as a component (e.g. |10j, jS], and Section |H1 below), 
many more additional arguments are also needed. 

While van der Waerden's theorem is not terribly difficult to prove (we give a proof 
in the next section), it already yields some non-trivial consequences. Here is one simple 
one: 

Proposition 1.5 (Quadratic recurrence). Let a be a real number and e > 0. Then one 
has \\<yr 2 \\^/z < e for infinitely many integers r, where ||x||r/z denotes the distance from 
x to the nearest integer. 

Proof. Partition the unit circle R/Z into finitely many intervals I of diameter < e/4. 
Each interval I induces a colour class {n e N : an 2 /2 mod 1 G 1} on the integers Z. 
(This is a basic example of a structured colouring; we will see the dichotomy between 
structure and randomness repeatedly in the sequel.) By van der Waerden's theorem, 
one of these classes contains progressions of length 3 with arbitrarily large spacing r, 
thus for each such r there is an n for which 

an 2 /2, a{n + r) 2 /2, a{n + 2r) 2 /2 G I. 
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The claim now follows from the identity 

an 2 /2 - 2a{n + r) 2 /2 + a{n + 2r) 2 /2 = ar 2 . 

D 

A modification of the argument lets one also handle higher powers ar k . More general 
polynomials (with more than one monomial, but with vanishing constant term) can 
also be handled, although the argument is more difficult. This simple example already 
demonstrates however that the number-theoretic question of the distribution of the 
fractional parts of polynomials is already encoded to some extent within Szemeredi's or 
van der Waerden's theorem. 

Szemeredi's theorem has many further important extensions and generalisations which 
we will not discuss here (see for instance Bryna Kra's lectures for some of these). In- 
stead, we will focus on two of the main approaches to proving Szemeredi's theorem in 
its full generality, namely the ergodic theory approach of Furstenberg and the combina- 
torial approach of Rodl and coauthors, as well as Gowers. We will also sketch in very 
vague terms the original combinatorial approach of Szemeredi. We will however not 
discuss the important Fourier- analytic approach, though, despite the many connections 
between that approach and the ones given here; see Ben Green's lectures for a detailed 
treatment of the Fourier- analytic method. The combinatorial and ergodic approaches 
may seem rather different at first glance, but we will try to emphasise the many sim- 
ilarities between them. In particular, both approaches are based around a structure 
theorem, which asserts that a general object (such as a subset A of the integers) can be 
somehow split into a "structured" component (which has low complexity, is somehow 
"compact", and has high self-correlation) and a "pseudorandom" component (which 
has high complexity, is somehow "mixing", and has negligible self-correlation). One 
then has to manipulate the structured and pseudorandom components in completely 
different ways to establish the result. 

2. Prelude: van der Waerden's theorem 

Before we plunge into proofs of Szemeredi's theorem, let us first study the much 
simpler model case of van der Waerden's theorem. This theorem has both a simple 
combinatorial proof and a simple dynamical proof; while these proofs do not easily 
scale up to proving Szemeredi's theorem, the comparison between the two is already 
illustrative. 

We begin with the combinatorial proof. There are three key ideas in the argument 
(known as a colour focusing argument). The first is to induct on the length of the 
progression. The second is to establish an intermediate type of pattern between a 
progression of length k and a progression of length k + 1, which one might call a 
"polychromatic fan" . The third is a concatenation of colours trick in order to leverage 
the induction hypothesis on progressions of length k, which allows one to move from 
one fan to the next. 

We need some notation. We use a + [0, k) ■ r to denote the arithmetic progression 
a, a + r, . . . , a + (k — l)r. 

Definition 2.1. Let c : {I, . . . , N} — > {1, . . . , m} be a colouring, let k > 1, d > 0, and 

a £ {I, . . . , N}. We define a fan of radius k, degree d, and base point a to be a rf-tuple 
(a + [0, k) ■ ri, . . . , a + [0, k) ■ r d ) of progressions in {1, ... , N} with r±, . . . , r d > 0. We 
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refer to the progressions a + [1, k) ■ r t , 1 < i < d as the spokes of the fan. We say that 
a fan is "polychromatic if its base point and its d spokes are all monochromatic with 
distinct colours. In other words, there exist distinct colours c , Ci, . . . , q G {1, . . . , m} 
such that c(o) = cq, and c(a + jr^) = q for all 1 < % < d and 1 < j < k. 

Theorem 2.2 (van der Waerden again). Let k, m > 1. T/ien t/iere exists N such that 
any m-colouring of {1, ... , N} contains a monochromatic progression of length k. 

It is clear that this implies Theorem II ,4| the converse implication can also be obtained 
by a simple compactness argument which we leave as an exercise to the reader. 

Proof. We induct on k. The base case k — 1 is trivial, so suppose k > 2 and the claim 
has already been proven for k — 1. 

We now claim inductively that for all d > there exists a positive integer N such that 
any m-colouring of {1, . . . , N} contains either a monochromatic progression of length 
k, or a polychromatic fan of radius k and degree d. The base case d — is trivial; as 
soon as we prove the claim for d = m we are done, as it is impossible in an m-colouring 
for a polychromatic fan to have degree larger than or equal to m. 

Assume now that d > 1 and the claim has already been proven for d — 1. We de- 
fine N = 4kNiN 2 , where Ni and N 2 are sufficiently large and will be chosen later. 
Let c : {1, . . . , N} — > {1, . . . , m} be an m-colouring of {1, . . . , N}. Then for any 
b G {1, . . . , A^}, the set {bkN x + 1, . . . , bkN ± + N x } is a subset of {1, ... , iV} of car- 
dinality Ni. Applying the inductive hypothesis, we see (if Ni is large enough) that 
{bkNi + 1, . . . , bkNi + iVi} contains either a monochromatic progression of length k, or 
a polychromatic fan of radius k and degree d — 1. If there is at least one b in which the 
former case applies, we are done, so suppose that the latter case applies for every b. This 
implies that for every b G {1, . . . , N 2 } there exist a(b),ri(b), . . . , ra-iib) G {1, . . . , Ni} 
and distinct colours Co(b), . . . ,Cd-i(b) G {1, . . . ,m} such that c(bkN\ + a(b)) = Co(b) 
and c(bkNi + a(b) + jri(b)) = Ci(b) for all 1 < j < k — 1 and 1 < i < d— 1. In particu- 
lar the map 6 i— > (a(6),ri(6), . . . , r d _i(b), c (b), . . . , Q_i(6)) is a colouring of {1, ... , N 2 } 
by m d Nf colours (which we may enumerate as {1, . . . ,m d Nf} in some arbitrary fash- 
ion). Thus (if N 2 is large enough) there exists a monochromatic arithmetic progression 
b+[0, k—l)-s of length k—1 in {1, ... , iV^}, with some colour (a, tt, . . . , r<f_i, Co, . . . , Q-i). 
We may assume without loss of generality that s is negative since we can simply reverse 
the progression if s is positive. 

Now we use an algebraic trick (similar to Cantor's famous diagonalization trick) which 
will convert a progression of identical fans into a new fan of one higher degree, the base 
points of the original fans being used to form the additional spoke of the new fan. 
Introduce the base point b := {b — s)kN x + a, which lies in {1, . . . , N} by construction 
of N, and consider the fan 



> 



(6o + [0, k) ■ skNi, b + [0, k) ■ (skNx + n), . . . , b + [0, k) ■ {skN x + r d _i)) 

of radius k, degree d, and base point 60 • We observe that all the spokes of this fan are 
monochromatic. For the first spoke this is because 

c(b + jskN,) = c((6 + {j - l)s)kN 1 + a)= c (b + {j - \)s) = c 

for all 1 < j < k — 1 and for the remaining spokes this is because 

c{b + ]{skN 1 + r t )) = c((6 + (j - l)s)kN 1 + a + jr t ) = c t {b + (j - l)s) = c t 



6 TERENCE TAO 

for all 1 < j < k — 1,1 < t < d — 1. If the base point &o has the same colour as one 
of the spokes, then we have found a monochromatic progression of length k; if the base 
point bo has distinct colour to all of the spokes, we have found a polychromatic fan of 
radius k and degree d. In either case we have verified the inductive claim, and the proof 
is complete. □ 

Now let us give the dynamical proof. Van der Waerden's theorem follows from the 
following abstract topological statement. Define a topological dynamical system to be 
a pair (X, T) where X is a compact non-empty topological space and T : X — > X is a 
homeomorphism 1 . 

Theorem 2.3 (Topological multiple recurrence theorem). ^HJ Let (X,T) be a topolog- 
ical dynamical system. Then for any open cover (V a ) a€ A of X and k > 2, at least one 
of the sets in the cover contains a subset of the form T^ ,k '' r x := {x, T r x, . . . , T^ k ~~ l ^ T x} 
for some x G X and r > 0. (We shall refer to such sets as progressions of length k.) 

Proof of van der Waerden assuming Theorem \2.?A Let c : Z — ► {l,...,m} be an m- 
colouring of the integers. We can identify c with a point x c := (c(n)) nG z in the discrete 
infinite product space {1, . . . , m} z . Since each {1, . . . , m} is a compact topological space 
with the discrete topology, so is {1, . . . ,m\ L . The shift operator T : {1, . . . ,m} z —>■ 
{l,...,m} z defined by T((x n ) ne z) := (x n -i)nez is a homeomorphism. Let X be the 
closure of the orbit {T n x c : n G Z}, then X is also compact, and is invariant under 
T, thus (X,T) is a topological dynamical system. We cover X by the open sets Vi : = 
{(x n ) n£ z '■ Xq = i} for i = 1, . . . ,m; by Theorem 12. H[ one of these open sets, say Vi, 
contains a subset of the form T^ ,k '' r x for some x G X and r > 0. Since X is the closure 
of the orbit {T n x c : n G Z}, we see from the open-ness of V, and the continuity of T 
that Vi must in fact contain a set of the form T^ 0,k ^' r T n x c . But this implies that the 
progression — n — [0, k) ■ r is monochromatic with colour i, and the claim follows. □ 

Conversely, it is not difficult to deduce Theorem 12 . HI from van der Waerden's theorem, 
so the two are totally equivalent. One can view this equivalence as an instance of a cor- 
respondence principle between colouring theorems and topological dynamics theorems. 
By invoking this correspondence principle one leaves the realm of number theory and 
enters the infinitary realm of abstract topology. However, a key advantage of doing this 
is that we can now manipulate a new object, namely the compact topological space X. 
Indeed, the proof proceeds by first proving the claim for a particularly simple class of 
such X, the minimal spaces X, and then extending to general X. This strategy can of 
course also be applied directly on the integers, without appeal to the correspondence 
principle, but it becomes somewhat less intuitive when doing so (we invite the reader 
to try it!). 

The space X encodes in some sense all the "finite complexity, translation-invariant" 
information that is contained in the colouring c. For instance, if c is such that one 
never sees a red integer immediately after a blue integer, this fact will be picked up in 
X (which will be disjoint from the set {(x n ) ne z : xq blue, Xi red}). The correspondence 



As it turns out, T only needs to be a continuous map rather than a homeomorphism, but we retain 
the homeomorphism property for some minor technical simplifications. It is also common to require 
X to be a metric space rather than a topological one but this does not make a major difference in the 
argument. 
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principle asserts that a colouring theorem can be derived purely by exploiting such 
information. 

Definition 2.4 (Minimal topological dynamical system). A topological dynamical sys- 
tem (X, T) is said to be minimal if it does not contain any proper subsystem, i.e. there 
does not exist C Y C X which is closed with TY = Y. 

Example 2.5. Consider the torus X = IR/Z with the doubling map Tx := 2x. Then 
the torus is not minimal, but it contains the minimal system {0}, the minimal system 
{1/3,2/3}, and many other minimal systems. On the other hand, the same torus with 
an irrational shift Tx := x + a for a ^ Q is minimal. Minimality can be viewed as 
somewhat analogous to ergodicity in measure-preserving dynamical systems. 

Lemma 2.6. Every topological dynamical system contains at least one minimal topo- 
logical dynamical subsystem. 

Proof. Observe that the intersection of any totally ordered chain of topological dynam- 
ical systems is again a topological dynamical system (the non-emptiness of such an 
intersection follows from the finite intersection property of compact spaces). The claim 
now follows from Zorn's lemma. □ 



In light of this lemma, we see that in order to prove Theorem 12.31 it suffices to do so 
for minimal systems. One advantage of working with minimal systems is the following. 

Lemma 2.7. Let (X,T) be a minimal dynamical system, and let V be a non-empty 
open subset in X . Then X can be covered by finitely many shifts T n V of V. 

Proof. If the shifts T n V do not cover X, then the complement X\ (J neZ T n V is a proper 
closed invariant subset of X, contradicting minimality. Thus the T n V cover X, and the 
claim follows from compactness. □ 

Remark 2.8. There is a notion of a minimal colouring of the integers that corresponds 
to a minimal system; informally speaking, a minimal colouring is one that does not 
"strictly contain" any other colouring, in the sense that the set of finite blocks of the 
latter colouring is a proper subset of the set of finite blocks of the former colouring. 
This lemma then asserts that in a minimal colouring, any block that does appear in that 
colouring, in fact appears syndetically (the gaps between each appearance are bounded). 
Minimal colourings may be considered "maximally structured" , in that all the finite 
blocks that appear in the sequence, appear for a "good reason". The opposite extreme to 
minimal colourings are pseudorandom colourings, in which every finite block of colours 
appears at least once in the sequence (so X is all of {1, ... , k} z ). 



Now we can prove Theorem 12 .31 for minimal dynamical systems. We induct on k. The 
k — 1 case is trivial; now suppose that k > 2 and the claim has already been proven 
for k — 1, thus given any open cover of X, one of the open sets contains a progression 
of length k — 1. Combining this with Lemma (2.71 (and the trivial observation that the 
shift of a progression is again a progression), we obtain 

Corollary 2.9. Let {X,T) be a minimal dynamical system, and let V be a non-empty 
open subset in X . Then V contains a progression of length k — 1. 

Now we can build fans again. 
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Definition 2.10. Let (X, T) be a minimal dynamical system, let {V a ) a& A be an open 
cover of X, let d > 0, and x G X. We define a /an o/ radius k, degree d, and base point 
a; to be a d-tuple (T^°' k ^ ri x, . . . , T^ 3,k '" rd x) of progressions of length k with r 1; . . . , r^ > 0, 
and refer to the progressions a + [1, k) ■ 7% 1 < i < a 7 as the spokes of the fan. We say 
that a fan is polychromatic if its base point and its d spokes each lie in a distinct element 
of the cover. In other words, there exist distinct ao, . . . , aa G A such that x G A ao and 
T jr *x G A a% for all 1 < i < d and 1 < j < k. 

To prove Theorem 12.31 it now suffices to show 

Proposition 2.11. Let {X,T) be a minimal dynamical system, and let (V a ) a( zA be an 
open cover of X . Then for any d > either there exists at least one polychromatic 
fan of radius k and degree d, or at least one of the sets in the open cover contains a 
progression of length k. 

Indeed, by compactness we can make the open cover finite, and the above proposition 
leads to the desired result by taking d large enough. 

Proof. The base case d = is trivial. Assume now that d > 1 and the claim has already 
been proven for d — 1. If one of the V a contains a progression of length k we are done, 
so we may assume that we have found a polychromatic fan (T^ ' kS) ' ri x, . . . , T^ ' k ^ Td - 1 x) of 
degree d—1, thus there exist distinct a , . . . , a^-i G A such that x G A aQ and T J> *x G A a% 
for all 1 < % < d — 1 and 1 < j < k. Since the T J>i are continuous, we can thus find a 
neighbourhood V of x in A ao such that T^ n V C A a . for all 1 < i < d — 1 and 1 < j < k. 
By Corollary 12.91 V contains a progression of length k — 1, say T^' k ^ ro y. Thus we see 
that T jro y G A Q0 for 1 < j < k, and T j( - r ° +r ^y G A Qi for 1 < j < k and 1 < i < d - 1. 
The point y itself lies in an open set A a . If a equals one of the a , a%, . . . , a^, then 
Vq, contains a progression of length fc; if a is distinct from a , a%, . . . , ctrf_i, we have a 
polychromatic fan of degree d. The claim follows. □ 

As one can see, the topological dynamics proof contains the same core arithmetical 
ideas as the combinatorial proof (namely, that a progression of fans can be converted to 
either a longer progression, or a fan of one higher degree) but the argument is somewhat 
cleaner as one does not have to keep track of superfluous parameters such as N. For the 
particular purpose of proving van der Waerden's theorem, the additional overhead in 
the dynamical proof makes the total argument longer than the combinatorial proof, but 
for more complicated colouring theorems the dynamical proofs tend to eventually be 
somewhat shorter and conceptually clearer than the combinatorial proofs, which often 
burdened with substantial notation. The dynamical proofs seem to rely quite heavily 
on infinitary tools such as Tychonoff's theorem and Zorn's lemma, though one can 
reduce the dependence on these tools by making the argument more "quantitative" (of 
course, if one removes the infinitary framework completely, one ultimately ends up at 
an argument which is more or less just some reworking of the combinatorial argument). 

3. Shelah's argument 

Let us now present another proof of van der Waerden's theorem, due to Shelah [HB]; it 
gives slightly better bounds by avoiding inductive arguments which massively increase 
the number of colours in play. This argument in fact proves a much stronger theorem, 
namely the Hales-Jewett theorem, but we shall content ourselves with a slightly less 
general result in order to avoid a certain amount of notation. 
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Definition 3.1 (Cubes). A cube of dimension d and length k is any set of integers of 
the form 

a + [0, k) d ■ v = {a + nxV\ + . . . + n d v d : < n\, . . . , n d < k} 

where a G Z and v — (vi, . . . ,v d ) is a d-tuple of positive integers, with the property 
that all the elements a + riiVi + . . . + n d Vd are distinct. 

Cubes are a special case of generalised arithmetic progressions, which play an impor- 
tant role in this subject. 

Theorem 3.2 (Hales- Jewett theorem). [2E] Let Q be a cube of dimension d and length 
k which is coloured into m colour classes. If j > 1, and d is sufficiently large depending 
on k,m,j, then Q contains a monochromatic subcube Q' of dimension j and length k. 

Note that the interval {1, . . . , k d } can be viewed as a proper cube of dimension d and 
length k. As such, we see that the van der Waerden theorem follows from the j = 1 
case of the Hales- Jewett theorem. (The original proof of this theorem proceeded by a 
colour focusing argument that directly generalised that used to prove van der Waerden's 
theorem, and we leave it as an exercise.) 

Shelah's proof of this theorem proceeds by an induction on the length k. The k = 1 
case is trivial, so suppose that k > 1 and that the theorem has already been proven for 
k — 1. Let us call a subcube 

Q' = {a + n\V\ + . . . + n d v d : < m, . . . , n d < k} (3.1) 

of Q weakly monochromatic if whenever one of the m, . . . , n d is swapped from k — 1 to 
k or vice versa, the colour of the element of Q' is unchanged. It will suffice to show 

Theorem 3.3 (Hales- Jewett theorem, first inductive step). Let Q be a cube of dimen- 
sion d and length k which is coloured into m colour classes. If j > 1, and d is sufficiently 
large depending on k,m,j, then Q contains a weakly monochromatic subcube Q' of di- 
mension j and length k. 

To prove Theorem 13 .2[ one may first without loss of generality "stretch" the cube 
Q by making each Vi enormously large compared with the previous u$_i. This allows 
us to eliminate certain "exotic" sub-cubes which would cause some technicalities later 
on. Then, we let J be a large integer depending on k,m,j to be chosen later. If 
d is large enough depending on k, m, J, then by Theorem 13.31 we can find a weakly 
monochromatic subcube Q' of Q of dimension J and length k. We contract each of 
the edges by 1 (deleting all the vertices where one of the n t is equal to k) to create a 
subcube Q" of Q of dimension J and length k — 1. By the induction hypothesis, we 
see that if J is large enough then Q" will in turn contain a monochromatic cube Q"' of 
dimension j and length k — 1. Since Q' was weakly monochromatic, one can verify that 
Q'" extends back to a monochromatic cube Q"" of dimension j and length k, which is 
contained in Q, and the claim follows. 

It remains to prove Theorem 13.31 Let us modify the notion of weakly monochromatic 
somewhat. Let us call the subcube (J3.1J1 i-weakly monochromatic for some < i < d if 
whenever one of the n 1; . . . , n^ is swapped from k — 1 to k or vice versa, the colour of 
the element of Q' is unchanged. It will suffice to show 

Theorem 3.4 (Hales- Jewett theorem, second inductive step). Let Q be a cube of di- 
mension d and length k which is coloured into m colour classes which is already i-weakly 
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monochromatic for some i > 0. If j > i + 1, and d is sufficiently large depending on 
k, m,j, i, then Q contains a i + 1-weakly monochromatic subcube Q' of dimension j and 
length k. 

Indeed, by iterating Theorem 13.41 in i we see that for d large enough depending on 
k, m,j, i, Q will contain an i-weakly monochromatic subcube of dimension j and length 
k (the case i = is trivial); setting i = j we obtain Theorem 13.31 

It remains to prove Theorem 13.41 As a warmup (and because we need the result to 
prove the general case) let us first give a simple special case of this theorem. 

Lemma 3.5 (Hales- Jewett theorem, trivial case). Let Q be a cube of dimension d and 
length k which is coloured into m colour classes. If d > m + I, then Q contains a 
1-weakly monochromatic subcube Q' of dimension 1 and length k. 

Proof. Write 

Q = {a + n 1 v 1 + . . . + n d v d : < m, . . . , n d < k} 

and consider the m + 1 elements of Q of the form 

a + (k - \)v\ + ... + (k — l)v s + kv s+ i + ... + kv m+ i 

where s ranges from 1 to m + 1. By the pigeonhole principle two of these have the 
same colour, thus we have l<s<s'<m+l such that the (1-dimensional, length k) 
subcube 

{a + (k — l)vi + . . . + (k — l)v s + n(v s+ i + . . . + v s i) + kv s / + i . . . + kv m+ i : 1 < n < k} 

is 1-weakly monochromatic, and the claim follows. □ 

Now we can prove Theorem 13.41 and hence the Hales- Jewett theorem. The main idea 
is to recast the cube Q, not as an i-weakly monochromatic m-coloured cube of dimension 
d and length k, but rather as an m -coloured cube of dimension d — j + 1 and length 
k. More precisely, let us write 

Q = { a + n 1 v 1 + • • • + n d v d : < m, . . . , n d < k} 
and consider now the modified cube of dimension d — j + 1 and length k 

Q := {a + njVj + . . . + n d v d : < rij, . . . , n d < k}. 

Note that each element x G Q is associated to W~ x elements of Q, namely 

{x + n x vi + • • • %_iU,_i}. 

Each of these elements has m colours, and so we can naturally associate an m kJ - 
colouring of Q. If d (and hence d — j + 1) is large enough, we can apply Theorem 13.51 
and find a 1-weakly monochromatic subcube Q' of dimension 1 and length k in Q. It 
is easy to verify that this in turn induces a i + 1-weakly monochromatic subcube Q' of 
dimension j and length k in Q, and we are done. 

4. The Furstenberg correspondence principle 

In a previous section, we saw how van der Waerden's theorem was shown to be 
equivalent to a recurrence theorem in topological dynamics. Similarly, Szemeredi's 
theorem is equivalent to a recurrence theorem in measure-preserving dynamics. 
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Definition 4.1. A measure-preserving system (X, B, /i, T), is a probability space (X, B, /x), 
where B is a a-algebra of events on X, \i : B — > [0, 1] is a probability measure (thus 
\i is countably additive with /a(X) = 1), and the shift map T : X — ■> X is a bijection 
which is bi-measurable (thus T n : £> — > i3 for all neZ) and probability preserving (thus 
^(T™^) = /i(E) for all £ G <B and neZ). 

Example 4.2 (Circle shift). Ta£;e X to be the circle R/Z rai/i t/ie Borel a -algebra B, 
the uniform probability measure ft, and the shift T : x \— > x + a where a G R. 27ms 
T n .E = i? + na /or any E E B. This system is to recurrence theorems as quasiperiodic 
sets, such as the Bohr set {n G Z : ||na||]R/z < #}; ^s ^° Szemeredi's theorem - it is an 
extreme example of a structured set. 

Example 4.3 (Finite systems). Take X to be a finite set, and let B be the a -algebra 
generated by some partition X = Ai U . . . U A n of X into non-empty sets A\, . . . , A n 
(these sets are known as "atoms"). Thus a set is measurable in B if and only if it is the 
finite union of atoms. We take \x to be the uniform measure, thus [i(E) := |.E|/|X| for 
all E G B. The shift map T : X —>■ X is then a permutation on X , with the property that 
it maps atoms to atoms. Note that if two atoms have different sizes, it will be impossible 
for the shift map (or any power of the shift map) to take one to the other. If one assumes 
that the shift map is ergodic (we will define this later), this forces all the atoms to have 
the same size. The finite case is not the case of interest in recurrence theorems, but it 
does serve as a useful toy model that illustrates many of the basic concepts in the proofs 
without many of the technicalities. Finite systems have a counterpart in Szemeredi's 
theorem as periodic sets - which are trivial for the purpose of demonstrating existence 
of arithmetic progressions, but still serve as an important illustrative special case for 
certain components of the proof of Szemeredi's theorem. 

Remark 4.4. The shift T induces an action n i— > T n of the additive integer group Z on 
X. One can also study actions of other groups; for instance, actions of 7? are described 
by a pair S, T of commuting bi-measurable probability preserving transformations. 

Given any measure-preserving system (X,B, [J,,T), a set E, and a point x G X, we 
can define the recurrence set A = A x ^e C Z of integers by the formula 

A^e :={neZ: T n x G E}. (4.1) 

This is a way of identifying sets E in a system with sets A in the integers. Similarly, 
given a function / : X — > R on the system, and an x G X, we can define an associated 
sequence F = F x j : Z — ► R by the formula 

F xJ (n) := f(T n x). (4.2) 

This correspondence between sets and functions on the system, and sets and functions 
on the integers, underlies the Furstenberg correspondence principle. In particular, it 
allows one to equate Szemeredi's theorem - which is a theorem on the integers - to the 
following theorem on measure-preserving systems. 

Theorem 4.5 (Furstenberg multiple recurrence theorem), ^l] Let (X,B, //, T) be a 

measure-preserving system. Then for any set E G B of positive measure n{E) > and 
any k > 1, we have 

liminf E 1 < r < 7V /i(E n T r E n . . . H T^~ 1)r E) > 
where we use the averaging notation Ei< r <7v/(r) := 4 J2 r =i f( r )- 



12 TERENCE TAO 

Remark 4.6. The k = 1 case is trivial. The k = 2 case follows easily from the 
pigeonhole principle and is known as the Poincare recurrence theorem. The k = 3 case 
can be handled by spectral theory (i.e. Fourier analysis). However the general k case is 
significantly harder. It is known that the limit on on the left actually exists, but this is 
significantly harder (see Bryna Kra's lectures). 

As one consequence of this theorem, we see that every set in A of positive measure 
contains arbitrarily long progressions. This should be contrasted with Theorem 12.31 
which can easily be shown to be a special case of Theorem 14.51 

The Furstenberg correspondence principle asserts an equivalence between results such 
as Szemeredi's theorem in combinatorial number theory, and recurrence theorems in 
ergodic theory. Let us first show how the recurrence theorem implies Szemeredi's theo- 
rem. 

Proof of Szemeredi's theorem assuming Theorem \4-5\ This shall be analogous to the 
topological correspondence principle, in which we shifted the colouring function c around 
and took closures to create the dynamical system X C {1, . . . , m} z . This time we shift a 
set A around and take weak limits to create the measure-preserving system X C {0, 1} Z . 
One can view this as "inverting" the correspondence (|4.1jl : whereas (|4.1|) starts with a 
set in a system and turns it into a set of integers, here we need to do things the other 
way around. 

More precisely, suppose for contradiction that Szemeredi's theorem fails. Then there 
exists a k > 1, a set A C Z without progressions of length k, and a sequence Aj of 
integers going to infinity such that liminfj^oo 2N-+i > ®- ^° w ^ or eacn z > cons ider 
the random set 

A~% '.= A + Xi 

where Xi is an integer chosen at random from [— Aj, iVj. As the subsets of Z can be 
identified with elements of X := {0, 1} Z , we can think of Ai as a random variable taking 
values in X. More precisely, if we let B be the Borel a-algebra of X, we can identify 
Ai with a probability measure /Zj on X (it is the average of 2Aj + 1 Dirac masses). 
Now A is a separable compact Hausdorff space, and so the probability measures are 
weakly sequentially compact. This means that (after passing to a subsequence of i if 
necessary), the fii converge to another probability measure ft in the weak sense, thus 

lim / / dfXi — f d/i 
^°° Jx ' Jx 

for any continuous function / on A. In particular, if we let 2 E := {(x n ) ng z £ {0, 1} Z : 
x n = 1}, then since E is both open and closed, 



But a computation shows 



fii(E) 



lim m(E) = fi(E). 

i— >oo 

\An[-Ni,Ni] 



2A. + 1 



This is the correct choice of E if one wants to invert the equivalence (|4.1(l . Indeed, identifying A 
with a point in X, we see that A — Eae, Ai — Ea^e, and so forth. 
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and hence fi(E) > 0. Similarly, if T : X — > X is the shift operator T(x n ) neZ : = 
( x n-i)nez, then a brief computation shows that 

Urn fii{TE) - faiE) = 

and more generally 

lim fH(TF) - fjLi(F) = 

whenever F is a finite boolean combination of E and its shifts. This means that 

n{TF) = n(F) 

for all such F, and then by the Kolmogorov extension theorem we see that /x is in fact 
shift-invariant. Finally, since A contains no arithmetic progressions of length k, we see 

that 

fa{EnT r E n...n r (fc - 1)r £) = o 

for any r > 0, and hence on taking limits 

/i(£nT r £n...nT (M)r E) = o. 

These facts together contradict the Furstenberg recurrence theorem, and we are done. 

□ 

One can easily show that the Szemeredi theorem and the Furstenberg recurrence the- 
orem are equivalent to slightly stronger versions of themselves. For instance, Fursten- 
berg's multiple recurrence theorem generalises to 

Theorem 4.7 (Furstenberg multiple recurrence theorem, again). Let (X,B, fi,T) be a 
measure-preserving system. Then for any bounded measurable function f : X — > [0, 1] 
with f x f dfx > and any k > 1, we have 

lim inf Ei< P <tf / fT r f . . . T (fc ~ 1)r / dfi > (4.3) 

N^oo J x 

where T r f := / o T~ r is the translation of f by r. 

This follows simply because if f x f dp, > 0, then we have the pointwise bound / > cIe 
for some c > and some set E of positive measure, where 1# is the indicator function of 
E. In a similar spirit, Szemeredi's theorem has the following quantitative formulation: 

Theorem 4.8 (Szemeredi's theorem, again). Let Z/NZ is a cyclic group. Then for any 
bounded function f : Z/iVZ — > [0, 1] with K n€ z/Nzf(n) > 5 > and any k > 1, we have 

E n>r&/NZ f(n)T r f(n) . ..T^ l >f{n) > c(k,8) 
for some c(k, 5) > which is independent of N, where T r f(n) := f(n — r). 

It is easy to see that Theorem 14.81 implies Szemeredi's theorem in its original for- 
mulation, and it can also be easily used (by using the correspondence (J4.2J) between 
functions and sequences) to prove Theorem 14 .71 or Theorem 14.51 (in fact it gives a lower 
bound on (J4.3J) which depends only on k and the mean j x f dp, of /). The converse 
implication requires an additional averaging argument is essentially due to Varnavides 
|4"Tj . We present it here: 
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Proof of Theorem \4-ty assuming Szemeredi's theorem. First we observe that for any k > 
1 and 5 > that there exists an M = M{8) such that any subset of [1,M] of density 
at least 5 contains at least one progression of length k. For if this were not the case, 
then one could find arbitrarily large N and sets Am C [1,M] with \Am\ > SM which 
contained no progressions of length k. Taking unions of translates of such sets (with 
M a rapidly increasing sequence) one can easily find a counterexample to Szemeredi's 
theorem. 

Now we prove Theorem 14.81 It is easy to see that f(n) > 5/2 on a set A C 7L/N7L of 
density at least 5/2. Thus it will suffice to show that 

^n,r&Z/NZ^-n,n+r,...,n+(k-l)reA ^fc,5 1- 

For N small depending on k, 5 this is clear (just from taking the r = case) so assume 
N is large. Let 1 < M < N be chosen later. It will suffice to show that 

^n&Z/NZ^l<r<M^-n,n+Xr,...,n+(k-l)\reA ^fc.5 1 

for all A G Z/iVZ, as the claim then follows by averaging in A. We rewrite this as 

^n£Z/NZ^l<m,r<M±n+rn,n+m+\r,...,n+m+(k-l)\reA ^k,S 1 

On the other hand, we have 

E„ e z/JVzEi< m <Mln+AmeA = 1^1/^ > ^ 

so we have Ei< m <Ml n+ A mg A > 5/2 for a set of n of density at least 5/2. For each such 
n, the set {1 < m < M : n + Am G A} has density at least 5/2, and so if we choose 
M = M(5/2) we have at least one 1 < m, r < M for which n + m,n + m + Ar, . . . , n + 
m + (k — l)Ar G A, and so 

^l<m,r<Mi-n+m,n+m+Xr,...,n+m+(k~l)\reA ^ M 1- 

Since M depends on fc, 5, the claim follows. □ 

Remark 4.9. One can also deduce Theorem \4-S\ directly from Theorem \4.p] by modifying 
the derivation of Szemeredi 's theorem from Theorem \4-5\ We sketch the ideas briefly 
here. One can replace f by a set A in Z/iVZ. One then randomly translates and dilates 
the function A on Z/iVZ and then lifts up to Z to create a random set A in Z. Now one 
argues as before. See jlSJ for a detailed argument. See also [1] for further exploration 
of uniform lower bounds in the Furstenberg recurrence theorem. 

5. Some ergodic theory 

We will not prove Theorem 14.51 or Theorem 14.71 here; see Bryna Kra's lectures for a 
detailed treatment of this theory. However we can illustrate some of the key concepts 
here. For those readers which are more comfortable with finite mathematical structures, 
a good model of a measure-preserving system to keep in mind here is that of the cyclic 
shift, where X = Z/iVZ, B = 2 X is the power set of X (so the atoms are just singleton 
sets) and T : n i— > n + 1 is the standard shift. Other finite systems of course exist 
(though any such system is ultimately equivalent to the disjoint union of finitely many 
such cyclic shifts). 

The basic ergodic theory strategy in proving Theorem 14.71 is to first prove this result 
for very structured types of functions - functions which have a lot of self-correlation 
between their shifts. As it turns out, this is equivalent to studying very structured 
factors B' of the cr-algebra B. One then extends the recurrence result from simple 
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factors to more complicated extensions of these factors, continuing in this process (using 
Zorn's lemma if necessary) until the full a-algebra is recovered (and so all functions are 
treated). This is a more complicated version of the topological dynamical situation, 
in which there was only one type of structured system, namely a minimal system, and 
the extension from minimal systems to arbitrary systems was trivial (after using Zorn's 
lemma) . 

In addition to structured functions, there will also be "anti-structured" or "mixing" 
functions which can be considered orthogonal to the structured functions. These can be 
viewed as functions for which there is absolutely no correlation between certain of their 
shifts. To oversimplify dramatically, one could make the following vague definitions for 
any k > 2: 

• A function / is mixing of order A: — 2 if there is no correlation between the shifts 
/, T n f, ..., T^-^f for generic n. 

• A (possibly vector-valued) function / is strongly structured of order k — 2 if 
knowledge of /, T n f, . . . , T^ k ~ 2 > n f can be used to predict yC^ 1 ) 71 f perfectly and 
"continuously" . 

• A function / is structured of order k — 2 if it is a component of a strongly 
structured function of order k — 2, or can be approximated to arbitrary accuracy 
by finite linear combinations of such components. 

These definitions can be formalised, for instance using the Gowers-Host-Kra semi- 
norms; see the lectures of Ben Green and Bryna Kra. We will not do so here. However 
we shall gradually develop some key examples of these concepts in this section. A fun- 
damental observation in the subject is that there is a structure theorem that (for any 
k > 2) decomposes any function uniquely into a structured component of order k — 2 
and a mixing component of order k — 2; indeed, the structured components end up being 
precisely those functions which are measurable with respect to a special factor Yfc_ 2 of 
B, known as the characteristic factor for fc-term recurrence 3 . To prove the Furstenberg 
recurrence theorem, one first proves recurrence for structured functions of order d for 
any d (by induction on d) , and then shows weakly mixing functions of order k — 2 are 
negligible for the purpose of establishing fc-term recurrence. Setting d = k — 2 and 
applying the structure theorem, one obtains the general case. 

These matters will be treated in more detail in Bryna Kra's lectures. Here we shall 
give only some extremely simple special cases, to build up some intuition. There will be 
a distinct lack of rigour in this section; for instance, we shall omit certain proofs, and 
be cavalier about whether a function is bounded or merely square integrable, whether 
a limit actually exists, etc. 

We now consider various classes of functions / : X —>■ R; occasionally we will take 
/ to be complex-valued or vector-valued instead of real-valued. All functions shall be 
bounded. 

The most structured type of functions / are the invariant functions, for which Tf = f 
(up to sets of measure zero, of course). These can be viewed as "(strongly) structured 
functions of order 0". It is trivial to verify the Furstenberg recurrence theorem for 



We are oversimplifying a lot here, there are some subtleties in precisely how to define this factor; 
in particular the factor Zk-2 constructed by Host and Kra |27| differs slightly from a similar factor 
Yfc-2 constructed by Ziegler 0B] because a slightly different (but closely related) type of averaging is 
considered, using k — 1-dimensional cubes instead of length k progressions. See j.'iOl for a comparison 
of the two factors. 
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such functions. It is also clear that these (bounded) functions / form a von Neumann 
algebra 4 , as the space L°°(X) T of bounded invariant functions is closed under uniform 
limits and algebraic operations. Because of this, we can associate a factor Y to these 
functions, defined as the least cr-algebra with respect to which all functions in L°°(X) T 
are measurable; because L°°(X) T was a von Neumann algebra, we see that L°°(X) T is 
in fact precisely those functions which are Y"o- m easurable. In other words, we take level 
sets / -1 ([a, &]) of invariant functions and use this to generate the a-algebra. One can 
equivalently write Y as the space of essentially invariant sets E, thus TE is equal to E 
outside of a set of measure zero. For instance, in the finite case Y consists of all sets 
that are unions of orbits of T; in the cyclic case X = Z/iVZ, Tx = x + n, Y consists of 
all sets that are cosets of the subgroup generated by n (so if n is coprime to N, the only 
sets in Yq are the empty set and the whole set). In the case of the circle shift X = R/Z, 
Tx = x + a, Yq is trivial when a is irrational but contains proper subsets of R/Z when 
a is rational. 

Complementary to the invariant functions are the anti-invariant functions, which are 
orthogonal to all invariant functions; these are the "mixing functions of order 0" . For 
instance, given any g G L°°(X), the function Tg — g is an anti-invariant function. In 
fact, all invariant functions can be approximated to arbitrary accuracy in L 2 (X) as 
linear combinations of such basic anti-invariant functions Tg — g. This is because if this 
were not the case, then by the Hahn-Banach theorem there would exist a non-invariant 
function / which was orthogonal to all of the Tg — g. But then / would be orthogonal 
to Tf — f, which after some manipulation implies that Tf — f has L 2 norm zero and so 
/ is invariant, contradiction. Because of this fact, we see that anti-invariant functions 
go to zero in the L 2 sense: 

/ JL L°°(X) T =► E^ r < N T r f ^ L2(x) as N - oo. (5.1) 

This can be seen by first testing on basic anti-invariant functions Tg — g (in which case 
one has a telescoping sum), taking linear combinations, and then taking limits. One 
specific consequence of this is the mixing property 

lim Ei< r < N / fT'g d\i = (5.2) 

nT-*00 J x 



N 



whenever at least one of / and g is anti-invariant. (Note that there is a symmetry 
due to the identity J x fT r g = J x gT~ r f.) We will refer to this as the generalised von 
Neumann theorem of order 0. 

From Hilbert space theory we know that every function / in L 2 (X) uniquely splits 
as the sum of an invariant function f u ± and an anti-invariant fy function. In fact, 
since the invariant functions are not only a closed subspace of L 2 (X), but are also the 
measurable functions with respect to a factor Yq, we can write explicitly f v ±. = E(/|Yq) 
and fu — f — E(/|Yo)i where the conditional expectation operator f h- > E(/|Yq) is 
simply the orthogonal projection from L 2 (X) to the subspace L 2 (Y ) of Y" - measura ble 
functions. 



It seems clear that the theory of von Neumann algebras is somehow lurking in the background of 
all of this theory, though strangely enough it does not play a prominent role in the current results. An 
interesting question is to investigate to what extent this theory would survive if L co {X) was replaced 
by a noncommutative von Neumann algebra. 
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If / is invariant, then clearly its averages converge back to /: 

/ G L°°(X) T =► Ex^^r-/ ^ L2(X ) / as N - oo. 

Combining this with (J5.1J) (and taking limits to extend L°° to L 2 ) we obtain the von 
Neumann ergodic theorem 

f e L 2 (X) => E 1 < r < Jv r r / ^ L2(X ) E(/|y ) as JV -> oo. 

This implies in particular that 



®i<n<N f fT n f dfi^ f /E(/|Y„) d/x = ||E(/|y )||W 



which already proves the k = 2 case of the Furstenberg recurrence theorem (and gives 
a precise value for the limit). 

Example 5.1. Consider the case of finite systems. Then the invariant functions are 
those functions which are constant of each of the orbits of T, while the anti-invariant 
functions are those functions which have mean zero on each of the orbits of T . If 
f : X — > R is a general function, then the invariant part E(/|Y ) is the function which 
assigns to each orbit ofT (i.e. to each atom ofY ) the average value of f on that orbit, 
while the anti-invariant part f — K(f\Y ) is formed by subtracting the mean of each orbit 
from the original function. It is an instructive exercise to verify all the arguments used 
to prove the von Neumann ergodic theorem directly in this finite system case. 

The factor Yq also leads to a useful ergodic decomposition of a general measure- 
preserving system into ergodic ones. A measure preserving system is said to be ergodic 
if Yq is trivial, thus every invariant set has measure either zero or one (or equivalently 
that every invariant function is constant almost everywhere). One can view the space 
X and the cr-algebra B as fixed, in which case ergodicity is a property of the shift- 
invariant probability measure \i. Then it turns out that while a general measure \x is 
not ergodic, it can always be decomposed (or disintegrated} as an integral J Y \i y dv{y) 
of ergodic shift-invariant probability measures \x x parameterised by some parameter y 
on another probability space Y. To formalise this decomposition in general requires a 
certain amount of measure theory, but in the case of a finite system the process is quite 
simple to describe. Namely, take Y to be the system (X,Y ,fi), and for each y G Y 
let fiy be the uniform distribution on the T-orbit {T n y : n G Z} of y. Then one easily 
verifies that /x = J Y \i y dv(y), and that each \x y is an ergodic measure (all invariant sets 
either have zero measure or full measure). The ergodic decomposition in this case is 
essentially just the decomposition of X into individual orbits of T, upon each of which 
T is ergodic. One can easily use the ergodic decomposition to reduce the task of proving 
Furstenberg's recurrence theorem to the special case in which the system is ergodic; we 
omit the details. This is somewhat analogous to the reduction in topological dynamics 
to minimal systems. Unfortunately, whereas in the dynamical case the assumption of 
minimality was very strong and lead quickly to a proof of the topological recurrence 
theorem, ergodicity is not by itself a strong enough condition to quickly obtain a direct 
proof of Furstenberg's recurrence theorem, and further classification and decomposition 
of the measure-preserving system is needed. As it turns out, one usually cannot usefully 
disintegrate the measure /j into any smaller invariant measures once one is at an ergodic 
system; however it is still possible (and useful) to disintegrate the measures into non- 
invariant measures, where the shift map does not act separately on each component, 
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but instead mixes them together using something called a "cocycle" . A simple finitary 
example occurs when considering a finite ergodic system (X, B, p, T) with B = 2 X which 
contains a shift-invariant factor B' C B. The ergodicity forces all the atoms in B' to 
be the same size, and thus they are all bijective (non-canonically) to a single set Z. 
This allows one can then parameterise XasYxZ, where Y is the collection of all the 
atoms of £>'; since the shift T maps one such atom to another, the factor (X, B', p, T) 
is then equivalent to a system (Y, 2 Y , is, S) on Y where is is uniform measure on Y, 
and the original shift can then be described as T(y,z) := (Sy,p y (z)) where for each 
y EY, the cocycle p y : Z — > Z is a permutation on Z. One can view X as an extension 
of Y, by converting each point y to a "vertical fiber" y x Z. We can disintegrate 
p = f Y p y dis(y) where p y is uniform measure on {y} x Z. These measures are not 
invariant; instead T will map p y to ps y for all y. The iterates T n are then described as 
T n {y,z) = (S n y, p y , n (z)), where the p y<n are defined using the cocycle equation 



Py,n+m — PS m y,n ° Py. 



in ■ 



This is a more complicated version of the more familiar equation T n+m = T n o T m , 
thus cocycles are more complicated versions of shifts (indeed as we just saw, a cocycle 
is simply the "vertical component" of a shift in a larger product space). The study of 
cocycles forms an integral part of the higher order recurrence theory but will not be 
discussed here. 

Now let us look at double recurrence (the k = 3 case of Theorem I4.7J1 , in which we 
investigate the limiting behavior of averages such as 



lim E x < r < N / fT r fT 2r f dp. (5.3) 

TV— >oo 

If / is invariant, then again this expression is easy to compute (it is just J x f 3 ). One 
may hope, as in the preceding discussion, that anti-invariant functions are negligible, in 
the sense that 

lim E 1<r<N [ fT r gT 2r h dp = 



N^oo j x 



whenever /, g, h are bounded at least one of /, g, h is anti-invariant. Unfortunately, this 
is not the case. For a very simple example, take the small cyclic group X = Z/MZ for 
odd M and let f = g = h be the function which equals M — 1 at and —1 elsewhere. 
Then these functions are all anti-invariant, but the above average can be computed to 
be M 2 — 1; the problem is that periodically (whenever n is a multiple of M) there is a 
huge "spike" in the value of J x fT n gT 2n h dp which imbalances the average dramatically. 
Thus periodic functions (ones in which T n f = f for some n > 0) cause a problem. More 
generally 5 , the eigenfunctions, in which Tf = e 2me f for some 6 G M./Z, will also cause a 
problem (note that invariant functions correspond to the case 8 = 0). Indeed if one sets 



A simple application of Fourier analysis or the spectral theorem reveals that every periodic function 
is a finite linear combination of eigenfunctions, with eigenvalues equal to roots of unity. 
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o 

h := / and g :— f , then we see that T 2r h = e 4 ' 7Tir9 h and T r g = e -47 ™" 6 g , and hence 6 

lim E x < r < N [ fT r fT 2r f dn= [ |/| 4 dfi ^ 0, 

w ^°° Jx Jx 

despite the fact that such eigenf unctions will necessarily be anti-invariant for 6^0 
(as eigenfunctions of the unitary operator T with distinct eigenvalues are necessarily 
orthogonal) . 

However, one can simply deal with these problems by devising a suitable factor (larger 
than Zq) to contain them. For instance, one can create the factor Yq generated by all 
the periodic functions. This factor can be larger than Zq (e.g. in the finite case, Yq 
is in fact everything). The periodic functions form an algebra (they are closed under 
arithmetic operations) but are not quite a von Neumann algebra because they are not 
quite closed under limits 7 . Nevertheless, the periodic functions are still dense in L 2 (Z ), 
which turns out to be good enough for most purposes. Even larger than Y Q is Z\, the 
factor generated by all eigenfunctions - this factor is known as the Kronecker factor. 
Now the eigenfunctions are not closed under addition (though they are closed under 
multiplication), however the space of quasiperiodic functions - finite linear combinations 
of eigenfunctions - is indeed an algebra. The closure of the quasiperiodic functions in 
L 2 are the almost periodic functions - and this is a von Neumann algebra, indeed an L 2 
function is almost periodic if it is measurable in Z\. One can classify all these properties 
in terms of the orbit {T n f : n 6 Z}: 

• / is invariant if and only if the orbit {T n f : n E Z} is a singleton. 

• / is periodic if and only if the orbit {T n f : n 6 Z} is finite. 

• / is an eigenfunction if and only if the orbit {T n f : n E Z} lives in a one- 
dimensional complex vector space. 

• / is quasiperiodic if and only if the orbit {T n f : n G Z} lives in a finite- 
dimensional vector space. 

• / is almost periodic if and only if the orbit {T n f : n G Z} is precompact (its 
closure is compact). 

Functions in these classes will be referred to as "structured functions of order 1" or 
"linearly structured functions"; the eigenfunctions 8 are "strongly structured functions 



This corresponds to the fact that sets of integers such as the Bohr set {n 6 Z : ||cm||ffi/z < e} have an 
unexpectedly high number of progressions of length three, due to the identity an— 2a(n+r)+a(n+2r) = 
0, which implies that if two elements of a progression lie in the Bohr set, then the third element has 
an unexpectedly high probability of doing so also. One should caution that this is not always the case; 
with the Behrend example in Proposition II. 31 when two elements of a progression lie in the set, then 
the third element has an unexpectedly small probability of lying in the set. Thus certain types of 
structure can in fact reduce the number of progressions present, though Szemeredi or Furstenberg tells 
us that they cannot destroy these progressions completely. This is another indication that the proof 
of this theorem has to be somewhat nontrivial (in particular, a naive symmetrisation or variational 
argument will not work). 

There does not seem to be a conventional name for what the uniform or L 2 limit of periodic 
functions should be called. One possibility is "pro-periodic" or "profinitely periodic" functions. 

An individual quasiperiodic function is usually not strongly structured, in the sense that f(x) does 
not determine T n f(x) in a continuous manner; however a quasiperiodic function is the component of 
a vector- valued function which is strongly structured. For instance, if X = (R/Z) 2 and T(x\,x%) = 
(x\ + ot\,xi -I- a-i) for rationally independent ai,a2, then f{x\,x<i) := e 2lTl \ x ^+ x ^^ is quasiperiodic 
but not strongly structured, however the vector-valued function ( e 2 ™( x i+x2) ^ g 27 ™^ e 2irix 2 ^ j s strongly 
structured. 
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of order 1" . The linear comes from the fact that the action of T n f behaves "linearly" 
in n; observe for instance that if / is an eigenfunction with eigenvalue e 2me then T n f = 
e 2nmd f. Now it turns out that one can get a good handle on the average (J5.3J) for all 
/ in the linearly structured classes - and more precisely we have a non-trivial lower 
bound when / is non-negative and not identically zero. We already saw what happened 
when / was invariant. If instead / was periodic with some period m, then we get 
a large positive contribution to (|5.Hjl (specifically, f x f 3 d/j,) when n is a multiple of 
m, which is already enough for a non-trivial lower bound. For the other cases, one 
can use a pigeonhole argument to show that almost periodic functions behave very 
much like periodic functions (hence the name), in the sense that given any e, we have 
\\T n f — f\\L 2 (x) < £ for a set of n of positive density. Note that if T n f is close to /, 
then (by applying T n and then the triangle inequality) T 2n f is close to / also, which 
can be used (together with Holder's inequality and the boundedness of /) to show that 
jrpnjrp2nj j g c } ose t J 3 . This gives a contribution close to J x f 3 d\i for all n in a set of 
positive density, and one still gets a good lower bound for /. Note that these arguments 
extend easily to higher averages such as those involving J x fT n f . . . T^ k ~^ n f dfi. (But 
problems will emerge with the other half of the argument, as orthogonality to linear 
structure is not enough to eliminate all problems with triple and higher recurrence.) 

There is another proof of recurrence for almost periodic functions which looks more 
complicated, but ends up being more robust and can extend (with some effort) to higher 
order cases. We know that the orbit {T n f : n e Z} is precompact, which means that 
for any e > one can cover this orbit by finitely many balls. This allows us to apply the 
van der Waerden theorem (or its topological counterpart) and conclude the existence of 
many progressions n, n + r, . . . , n + (k — l)r for which T n f, T n+r f, . . . , j in +( k - 1 ) r f are a \\ 
close to each other. This means that J x fT r f . . . T^ k ~ 1 ' r f dfx is close to f x f k dfx > 0, 
which can be used as before to get a nontrivial lower bound. 

Now we say that a function / is "mixing of order 1" , or "linearly mixing" , if it is 
orthogonal to all almost periodic functions, or in other words E(/|Zi) = 0. It turns out 
that a more useful characterisation of this mixing property exists. 

Lemma 5.2. A real-valued function f e L°°(X) is mixing of order 1 if and only if the 
self- correlation functions T n ff are asymptotically mixing of order ; in the sense that 

lim E_ JV < n < J v||E(T n //|Z )||i 2 = 0. (5.4) 

iv^oo 

Proof. (Sketch only) Suppose first that / obeys the property (|5.4|) . A Cauchy-Schwarz 
argument (based on something called the van der Corput lemma), which we omit, then 
shows that 

lim E- N < n < N \\E(T n gf\Z )\\ 2 L 2=0 

for any bounded g. If we apply this in the particular case that g is an eigenfunction, we 
have ||E(T n (7/|Z )||L2 = ||E(p/|Zo)||i2 and hence E(gf\Z ) = for all eigenfunctions g. 
In particular / is orthogonal to all eigenfunctions, hence to all quasiperiodic functions, 
hence to all almost periodic functions, and is thus mixing of order 1. 

Now suppose that (|5.4j) fails. We rewrite the left-hand side (ignoring issues regarding 
interchange of limit and integral, which can be justified using the von Neumann ergodic 
theorem applied to the product space X x X) as 

(/, lim E^ N < n < N E(T n ff\Z )T n f). 

N—*oo 
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Let us introduce the linear operator S : L 2 (X) —* L 2 (X) by 

Sf := lim E. N < n < N E(T n fg\Z )T n f 

N—+00 

(again, let us ignore the issue regarding whether this limit exists). Thus (f,Sf) 7^ 0. 
This is a self-adjoint operator (in fact, it is positive definite). Also, being the limit of 
averages of finite rank operators, it can be shown to be a compact operator. Finally, 
we have the translation invariance property T n S = ST n . In particular, this shows that 
the orbit of Sf lies in the range of S and is thus precompact: 

{T n Sf :neZ} = {ST n f :neZ}c{Sg: \\g\\ L2{x) < \\f\\v (x) }. 

This shows that Sf is almost periodic. Thus / is not orthogonal to all almost periodic 
functions, a contradiction. □ 

By using (|5.4jl and some Cauchy-Schwarz (more precisely, using the van der Corput 
lemma) one can show that weakly mixing functions of order 1 are negligible for the 
purposes of double recurrence; indeed, we have 

lim E 1<r<N [ fT r gT 2r h dfx = 

N->oo J x 

whenever /, g, h are bounded and at least one of /, g, h are mixing of order 1. We can 
refer to this as the generalised von Neumann theorem of order 1. On the other hand, 
every bounded function / has a unique decomposition / = E(/|Zi) + (/ — E(/|Zi) as 
an almost periodic function E(/|Zi) and a weakly mixing function / — E(/|Zi); I like 
to refer to this as the Koopman-von Neumann theorem 9 . Note also that if / is non- 
negative with positive mean, then the almost periodic component E(/|Zi) will be also. 
Combining this fact with the recurrence already obtained for almost periodic functions, 
and the negligibility of weakly mixing functions, we obtain recurrence for all functions, 
i.e. we have established the general k = 3 case of Furstenberg's multiple recurrence 
theorem. 

We now give the barest sketch of how things continue onward from here. For k = 4 
one needs to define notions of almost periodicity and weak mixing of order 2. Of the 
two, the latter is easier, because we can copy Lemma 1572} and declare a function / to be 
weakly mixing of order 2 if its self-correlations T n ff are asymptotically weakly mixing 
of order 1, thus 

lim E. N<n<N \\E{T n ff\Z 1 )\\l, = 0. 

iv— »oo 

(Many other equivalent definitions are possible.) Repeated application of van der Corput 
eventually shows that such functions are negligible for the averages 

lim E 1<r<N [ fT r gT 2r hT 3r k dp 

N^oo ~ ~ J x 

in the sense that this average vanishes whenever f,g,h,k are bounded and at least one 
is weakly mixing of order 2. It is not hard to show that there exists a unique factor Z 2 
(that extends Z\) such that the weakly mixing functions of order 2 are precisely those 



Lemma 15 .21 is also sometimes known as the Koopman-von Neumann theorem; the two facts are of 
course closely related. 
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functions / whose conditional expectation K(f\Z 2 ) vanishes. (In the work of Host and 
Kra, this factor Z 2 is generated by nonconventional averages such as 

lim E. N<aM<N T a fT b fT c fT a+b fT b+c fT a+c fT a+b+c f; 

this idea was then adapted for the finite setting in |23] as the notion of a dual function 
to construct a finitary analogue of this factor.) One would then like the almost periodic 
functions of order 2 to be some dense subclass of L 2 (Z 2 ). This can be done; the trick 
is to repeat the original definition of almost periodic, but view terms such as "finite 
dimensional" or "compact" not in terms of vector spaces over R (as we have implicitly 
been doing), but rather 10 as modules over the von Neumann algebra L°°(Z 1 ) of bounded 
almost periodic functions. In particular: 

• / is an eigenfunction of order 2 (also known as a quadratic eigenf unction) if and 
only if the orbit {T n / :n6Z} lives in a one- dimensional module over L°°(Z\). 

• f is quasiperiodic of order 2 if and only if the orbit {T n f : n G Z} lives in a 
finite-dimensional module over L°°(Zi). 

• f is almost periodic of order 2 if and only if the orbit {T n f : n G Z} can be 
"approximated to arbitrary accuracy" by subsets of finite-dimensional modules 
over L°°(Zi). (The precise definition is a little tricky and subtle; see [T2"j.) 

A quadratic eigenfunction can equivalently be defined (at least in the ergodic case) 
as a function / obeying an identity of the form T f = gf, where g is itself a linear 
eigenfunction, thus Tg = e 2m0 g for some 6 G R/Z. The origin of the term "quadratic" 
can then be observed from an inspection of the phase in the identity 

rpn £ 2nin(n— 1)8 n r 

From the closely related identity 

/T"(7 3 )T 2n (/ 3 )^ 3n 7=l/l 8 

one also sees that quadratic eigenfunctions are not negligible for the purposes of triple 
recurrence (indeed they end up being orthogonal to all quadratically mixing functions). 
Quasiperiodic functions of order 2 are special cases of 2-step nilsequences, which will be 
discussed in Bryna Kra's lectures. They can be viewed as components of vector-valued 
(or matrix-valued) quadratic eigenfunctions, and arise from what are known as finite 
rank extensions of the Kronecker factor Z\ . 

At any rate, the almost periodic functions of order 2 now form a dense subclass 
of L 2 (Z 2 ), and are an algebra, and so one can repeat previous arguments and reduce 
the proof of the Furstenberg recurrence theorem for k — 3 to the task of proving 
such recurrence for such quadratically almost periodic functions. This turns out to 
be complicated - in part because this result includes Proposition 11.51 as a special case 
(the case of quadratic eigenfunctions), and this proposition is itself not entirely trivial 
(requiring at a bare minimum some form of van der Waerden's theorem). Fortunately, 



The combinatorial analogue of this would be to partition the original space X into atoms - in 
this case, the atoms of Z±, and somehow work on each atom separately. Of course, things are not 
this simple because the atoms are usually not shift-invariant and so the shift structure is now more 
complicated, passing from one atom to the next. The graph theoretic approach, which we will discuss 
later, also relies heavily on restriction to atoms, but can cope with this with much greater ease because 
this approach "forgets" all the arithmetic structure and so there is nothing to destroy when passing to 
an atom. 
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the colouring argument given previously for almost periodic functions - which does use 
van der Waerden's theorem - extends (after nontrivial effort) to this case, and more 
generally to all orders, thus leading to a proof of the Furstenberg recurrence theorem. 
See [TT], [TJ], [12], as well as Bryna Kra's lectures. 

6. The graph theoretic approach 

Now we leave ergodic theory and turn to what (at first glance) appears to be a com- 
pletely different approach to Szemeredi's theorem, though at a deeper inspection one 
will find many themes in common. In the ergodic approach, it was the shift operator T 
which was the primary focus of investigation; the underlying set A of integers merely 
provided some probability measure for T to leave invariant. We have seen that the 
dynamical approach focuses almost entirely on the shift operator. In marked contrast, 
the hypergraph approach discards the shift structure completely; instead, it views the 
problem of finding an arithmetic progression as that of solving a set of simultaneous 
relations; these relations initially have some additive structure, but this structure is 
soon discarded, as these relations are soon modeled abstractly by graphs and hyper- 
graphs. With the forgetting of so much structure it is remarkable that any nontrivial 
progress can still be made; however there turn out to be deep theorems in (hyper)graph 
theory, comparable (though not directly equivalent) to the deep recurrence theorems in 
topological dynamics and ergodic theory, which allow one to proceed even after losing 
almost all of the arithmetic structure. It is a fascinating question as to what the "true" 
origin of these deep facts are - it seems to be some very abstract and general dichotomy 
between randomness and structure - and how they may be united with the ergodic and 
Fourier-analytic approaches. 

To illustrate the power of the graph theoretic approach, let us prove a theorem which 
looks similar to van der Waerden's theorem though it is slightly different. 

Theorem 6.1 (Schur's theorem). Suppose the positive integers Z + are finitely coloured. 
Then one of the colour classes contains a triple of the form {x, y,x + y}. 

Proof. Our task is to find x, y > and a colour class C for which we have the simulta- 
neous relations 

xeC 

yeC 

x + y eC. 

The problem is that these equations (three relations in two unknowns) are coupled 
together in an unpleasant way. However we can decouple things slightly by making the 
(somewhat underdetermined) substitution x = b — a, y = c — b for some a < b < c; 
our task is then to find such a < b < c and a colour class C for which we have the 
simultaneous relations 

b-aeC 
c-beC 
c — a G C. 

Now we have three relations in three unknowns, which is a bit better for the purposes 
of finding solutions. Furthermore, the relations are more symmetric in a, b, c, and each 
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relation only involves two of the three unknowns. This is all that we will need to 
proceed. Indeed, let us now edge-colour the complete graph on the natural numbers by 
assigning to each edge (a, b) with b > a, the colour of b — a in the original colouring 
(this is known as the Cayley graph associated to the original colouring). A solution to 
the above simultaneous relations is now nothing more than a monochromatic triangle 
in this graph. But the existence of such a triangle follows immediately from Ramsey's 
theorem. (Indeed one sees that one can even take a, b, c to be no larger than 6!) □ 

Note that we only used a very special case of Ramsey's theorem; using the full version 
of Ramsey's theorem leads to substantial generalisation of Schur's theorem, especially 
when combined with van der Waerden's theorem, known as Rado's theorem; see for 
instance [T7j . 

Now we see what can similarly be done for progressions of length three in a set A of 
integers. Actually it will be convenient to localise to a cyclic group Z/iVZ and prove 
the following. 

Theorem 6.2 (Roth's theorem, cyclic group version). Let N be a large integer, and 
let A C Z/iVZ be such that \A\ > 5N. Then there are at least c(5)N 2 progressions 
x, x + r, x + 2r in A for some c(5) > (we allow r to be zero). 

It is easy to see that this implies the k = 3 version of Szemeredi's theorem (and is 
in fact equivalent to it, thanks to the formulation in Theorem 14 ,8|) . Our task is to find 
many solutions to the system of relations 

n e A 

n + r G A 

n + 2r G A 

Again this is three equations in two unknowns. We add an unknown by making the 
underdetermined substitution n := — x 2 — 2x 3 , r := X\ + x 2 + x 3 and obtain the system 





-X 2 


-2x 3 


EA 


X\ 




-x 3 


eA 


-2xi 


—x 2 




eA 



This is again three relations in three unknowns, where each relation involves only two of 
the three variables; our task is to locate c(5)N 3 solutions. The situation is not quite the 
same as with Schur's theorem, though; for instance, the three relations are not entirely 
symmetric. On the other hand, we already know a lot of degenerate solutions to this 
system: 

— x 2 — 2x 3 G A 

Xi —x 3 G A 

—2x\ —x 2 G A 

x\ +x 2 +x 3 = 0. 

Indeed, every element of A generates iV such solutions, so we have c5iV 2 solutions in all. 
We can rephrase this as a conditional probability bound 

P(— x 2 - 2x 3 , x\ - x 3 , 2xi + x 2 G A\x\ + x 2 + x 3 = 0) > 5 (6.1) 

where we think of x, y, z as ranging freely over the cyclic group Z/iVZ, and then con- 
ditioned so that x + y + z = 0. Our goal seems innocuous, namely to remove this 
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conditional expectation and conclude that 

P(-x 2 - 2x 3 , xi - x 3 , 2xi + x 2 G A) > c(S). (6.2) 

This is less trivial than it first appears. The problem is that the event x + y + z = 
has tiny probability - 1/N - and so we only get a tiny lower bound of S/N if we 
naively apply Bayes' identity. (This corresponds to the fact that the number of triv- 
ial progressions - 5N 2 - is negligible compared with the number of progressions that 
we actually want, which is c(5)N 3 .) However, the point will be that the solution 
set {(x,y,z) : — x 2 — 2x3, xi — ^3,2xi + x 2 G A}, being the intersection of three 
"second-order" sets {(xi,x 2 ,x 3 ) : — x 2 — 2x 3 G A}, {(xi,x 2 ,x 3 ) : —x\ — x 3 G A}, 
{(xi,x 2 ,x 3 ) : 2xi + x 2 G A}, is not a completely arbitrary set, and as it turns out it 
cannot concentrate itself entirely on the "third-order set" {(xi, x 2 , x 3 ) : Xi+x 2 +X3 = 0}. 
For instance, observe that given any relation Xj ~ Xj involving just two of the Xi, x 2 , X3, 
we have 

P(xj ~ Xj\xi + x 2 + X3 = 0) = P(xj ~ Xj) (6.3) 

or given any sets A\,A 2 , we have 

P(xi G Ai,x 2 G A 2 |xi+x 2 = 0) < min(P(xi G A 1 ),P(x 2 G A 2 )) < P(x x G A h x 2 G A 2 ) 1/2 . 

So we see that when the structure of the set is sufficiently "low order" , one can remove 
the conditional expectation. Can one do so here? The answer is yes, and it relies on 
the following abstract result. 

Lemma 6.3 (Triangle removal lemma). [HE] Let G be a graph on n vertices that contains 
fewer than en 3 triangles for some < e < 1. Then it is possible to delete o £ ^o(n 2 ) edges 
from G to create a triangle-free graph G' . 

As usual we use o £ ^ (X) to denote a quantity which is bounded by c(e)X for some 
function c(e) of e which goes to zero as e — > 0. Later on we will allow the decay rate 
to depend on additional parameters, for instance o e ^o;fe(l) would be a quantity which 
decayed to zero as e — > for each fixed k, but which need not decay uniformly in k. An 
equivalent formulation of this lemma is: 

Lemma 6.4 (Triangle removal lemma, again). Let G be a graph on n vertices that 
contains at least 5n 2 edge-disjoint triangles for some < 5 < 1. Then it must in fact 
contain c(6)n 3 triangles, where c(<5) > depends only on 5. 

We leave the equivalence of these two formulations to the reader. From the second 
formulation it is an easy matter to deduce (J6.2)) from (J6.1J) . by considering the tripartite 
graph formed by three copies of V (corresponding to xi,x 2 ,X3 respectively), and with 
the three edge classes between these copies defined by the relations — x 2 — 2x3 G A, 
Xi — X3 G A, and 2xi + x 2 G A respectively; again, we leave this as an exercise for the 
reader. 

There is another way to phrase this lemma in a "several variable measure theory" 
language that brings it more into line with the ergodic theory approach (and also the 
Fourier-analytic approach) . 

Lemma 6.5 (Triangle removal lemma, several variable version). Let (X,fix), {Y,fiy), 
(Z,nz) be probability spaces, and let f : X x Y — > [0, 1], g : Y x Z — » [0,1], and 
h : Z x X — > [0, 1] be measurable functions such that 

A 3 (f,g,h)<£ 
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for some < e < 1, where A 3 is the trilinear form 



A 3 (/, g, h) := / / f(x,y)g(y,z)h(z,x) d/j, x (x)d/j, Y (y)diJ,z(z). 

Then there exists functions f : X xY — ► [0,1], j:7xZ-> [0,1], <wirf ft:^xl->[0,l] 
which differ from f,g,h in L 1 norm by o e ->o(l), thus 




\f(x,y)-f(x,y)\dfix(x)dfi Y (v), / \g(v,z) - g(u,z)\ dfi Y (v)dtiz(z), / / \h(z,x) - h(z,x)\ 
'xjy JyJz JzJx 

and such that f(x,y)g(y,z)h(z,x) vanishes identically (in particular, A 3 (f,g,h) = 0). 

One can easily deduce Lemma Ifi .HI from Lemma If). 51 by specialising X, Y, Z to be the 
finite vertex set V with the uniform probability measure, and let / = g = h be the 
indicator function of the edge set of the graph G; we omit the details. The converse 
implication is also true but somewhat tricky (one must discretise the measure spaces 
X,Y,Z, and split the atoms of such spaces to approximate the probability measures 
by uniform distributions, and also replace the functions f,g, h by indicator functions); 
we again omit the details. We will choose to work with the analytic formulation of the 
triangle removal lemma in these notes because it seems to extend more easily to the 
hypergraph setting (in which one considers similar expressions in more variables, where 
now each function can depend on three or more variables). 

Lemma ffi.51 asserts, roughly speaking, that if a collection of low complexity functions 
have a small product, then one can "clean" each function slightly in a low- complexity 
manner in order to make the product vanish entirely. Note that the claim would be 
trivial if one were allowed to modify (say) / in a manner which could depend on all 
three variables x, y, z. The power of the lemma lies in the fact that the high-complexity 
expression A 3 (/, g, h) can be manipulated purely in terms of low-complexity operations. 
This rather deep phenomenon seems to be rather general; in fact there is a similar 
lemma for any non-negative combination of functions of various collections of variables 
(we shall describe one such version a little later below). It is however still not perfectly 
well understood. 

The way one proves Lemma Ifj. 51 is by decomposing f,g,h into "structured" or "low 
complexity" components, which are easier to clean up, and "error terms", which for 
one reason or another do not interfere with the cleaning process because they give a 
negligible contribution to expressions such as A 3 (f,g,h). It turns out that there are 
two types of error terms which come into play. The first are errors which are "small" 
in an integral sense, say in L 2 norm, while the second are errors which are (very) small 
in a weak sense (for instance, they are small when tested against other functions which 
depend on other sets of variables). The latter will be encoded using a useful norm, the 
Gowers D 2 norm ||/||n 2 (Xxy) = 1 1 /I In 2 , defined for measurable bounded /:IxF-»l 
by the formula 

ff(xxr) := / / / f(x,y)f(x,y')f(x',y)f(x',y')dfix(x)dfxx(x')diXY(y)dfx Y (y')- 

J X J X JY JY 
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One easily verifies that the right-hand side is non-negative. From two applications of 
the Cauchy-Schwarz inequality one verifies the Gowers-Cauchy-Schwarz inequality 




foo(x,y)foi(x,y')f 10 (x',y)f n (x',y') djj lX {x)d^x{x')d^ Y {y)diiY{y')\ ,„ .. 

'XJXJYJY (6.4) 

< ll/oo||n 2 ||/oi||n 2 ||/io||n 2 ||/ii||n 2 

from which one readily verifies that D 2 obeys the triangle inequality and is thus at least 
a seminorm. From the Gowers-Cauchy-Schwarz inequality (and bounding the D 2 norm 
crudely by the L°° norm) one also sees that 




f(x,y)g(y)h(x) dn Y {x)dn Y {y)\ < ll/lb 2 (6.5) 

>x jy 

whenever g, h are measurable functions bounded in magnitude by 1; this in particular 
shows that if ||/||n 2 =0 then / is zero almost everywhere. Thus the D 2 norm is indeed 
a norm 11 , after the customary convention of identifying two functions that agree almost 
everywhere. Letting g, h depend on a third variable z in ()6.5|) and integrating in z, and 
using symmetry, we thus conclude the generalised von Neumann inequality 

\A 3 (f,g, h)\ < min(||/|b, \\g\\n*, INIn 2 ) (6.6) 

whenever f : X xY ^ [—l,l],g:YxZ—* [— 1, 1], h : Z x X — > [— 1, 1] are measurable. 
Thus functions with tiny D 2 norm have a negligible impact on the A 3 form; such 
functions are known as pseudorandom or Gowers uniform. To exploit this, one would 
now like to decompose arbitrary functions / : X x Y — » [0, 1] into a "structured" 
component which can be easily analysed and manipulated, plus errors which are small 
in D 2 or are otherwise easy to deal with. The first key observation is 

Lemma 6.6 (Lack of uniformity implies correlation with structure). Let f : X x Y — > 
[—1, 1] be such that ||/||n 2 > V for some rj > 0. Then there exists A C X and B C Y 
such that 

| / / l A (x)l B (y)f{x,y) dfj, x (x)dfj, Y (y)\ > ?? 4 / 4 - 

J X JY 

Proof. By definition of the D 2 norm we have 

f(x,y)f(x,y')f(x',y)f(x',y') dp x (x)diix{rf)diiY{y)diiYW) >V A - 




I X JY J X JY 

By the pigeonhole principle and the boundedness of /, we can thus find x', y' such that 
I / / f{x,y)f(x,y')f{x',y) rf/i X (x)(i/iy(|/)| > r/ 4 . 

JX JY 

We rewrite this using Fubini's theorem as 

|/ / sgn(s)sgn(t) / lA a (x)l Bt{y) f(x,y) dfi X (x)dfi Y (y)dsdt\>i] 4 

J -l J -l JX JY 

One can also identify the D 2 norm with the Schatten-von Neumann 4-norm of the integral operator 
with kernel f(x,y); in the important special case when X is a finite set with the uniform distribution, 
and / is symmetric, then the D 2 norm is simply the l A norm of the eigenvalues of the matrix associated 
to /. If / is the indicator function of a graph G, the □ norm is a normalised count of the number 
of 4-cycles in G. However we will not take advantage of these facts as they do not generalise well to 
hypergraph situations. 
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where A s := {x G X : sgn(s) f (x , y') > \s\} and B t := {y G Y : sga(t) f (x 1 , y) > \t\}. 
The claim then follows from another application of the pigeonhole principle. □ 

To exploit this we borrow some notation from the ergodic theory approach, namely 
that of cr-algebras and conditional expectation. However, in this simple context we will 
only need to deal with finite cr-algebras. If B is a finite factor of X (i.e. a finite a-algebra 
of measurable sets in X), then B is essentially just a partition of X into finitely many 
disjoint atoms A±, . . . ,Au (more precisely, B is the cr-algebra consisting of all finite 
unions of these atoms). If / : X — > K. is measurable, then the conditional expectation 
E(f\B) : X — > M. is the function defined as E(/|£>)(x) := -h f A . f(x) dfix{x) whenever 
x lies in an atom A\ of positive measure. (Conditional expectations are only defined up 
to sets of measure zero, so we can define E(f\B) arbitrarily on atoms of measure zero.) 
We say that a factor has complexity at most m if it is generated by at most m sets (and 
thus it contains at most 2 m atoms). If Bx is a finite factor of X with atoms A\, . . . , Am, 
and By is a finite factor of Y with atoms B\, . . . , B N , then Bx V By is a finite factor of 
X xY with atoms A { x B i for 1 < i < M and 1 < j < N. 

The key relationship between the D 2 norm and conditional expectation on finite 
factors is the following. 

Lemma 6.7 (Lack of uniformity implies energy increment). Let Bx, By be finite factors 
of X, Y respectively of complexity at most m, and let f : X x Y — » [0, 1] be such that 

||/-E(/|£ X VBy)lbpCxY)>?7 

for some 7] > 0. Then there exists extensions B' x , B' Y of Bx,By of complexity at most 
m + 1 such that 

\\E(f\B' x V B' Y )\\l 2{XxY) > \\E(f\B x V By)\\l HXxY) + 77 8 /16. 

Here of course H^l^p^y) := f x f Y \F(x,y)\ 2 d/j, x (x)d/j, Y (y). 

The key point here is that / - which is a "second-order" object, depending on two 
variables - is correlating with two "first-order" objects B' x , B' Y . This ultimately will 
allow us to approximate the second-order object by a number of first-order objects. It 
is this kind of reduction - in which a single high-order object is traded in for a large 
number of lower-order objects - which is the key to proving results such as the triangle 
removal lemma. The quantity ||E(/|£>x V By)\\ 2 L2 t XxY \ is known as the index of the 
partition Bx V By in the graph theory literature; here we shall refer to it as the energy 
of this partition. 

Proof. From Lemma 16. 61 we can find measurable A C X, B C Y such that 




l A (x)l B (y)(f - E(f\B x V By)) dnx(xW Y (y)\ > r/ 4 /4. 
'x jy 

Let B' x be the factor of X generated by Bx and A, and similarly let B' Y be the factor 

of Y generated by By and B, then B' X ,B Y have complexity at most m + 1. Since 

\ A {x) 1_b {y) is B' x V By measurable, we have 

l A (x)l B (y)(f - E{f\B x V By)) d^ x {x)d^y{y) = 




X JY 



f f l A (x)l B (y)E(f - E(f\B x V B Y )\B' X V B' Y ) d l x x {x)d l x Y {y) 
Jx Jy 
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so by Cauchy-Schwarz 

||E(/ - E(f\B x V B Y )\B' X V B' Y )\\ L2{XxY) > r/ 4 /4. 
Now observe that the quantity 

E(/ - E(f\B x V B Y )\B' X V B' Y ) = E(f\B' x V B' Y ) - E(f\B x V B Y ) 
is orthogonal to E(f\B x V By)- The claim then follows from Pythagoras' theorem. □ 

Note that if / is bounded by 1, then the quantity ||E(/|£>x V£>y)||^ 2 (xxy) * s bounded 
between and 1. Thus an easy iteration of the above lemma gives 

Corollary 6.8 (Koopman-von Neumann decomposition). Let B X ,B Y be finite factors 
of X, Y respectively of complexity at most m, let f : X x Y —>■ [0, 1] be measurable, and 
let rj > 0. Then there exists extensions B' x , B' Y of B x , By of complexity at most m + t| 
such that 

\\f -E{f\B' x V B Y )\\ mx «Y) <V- 

This corollary splits / into a bounded complexity object E(f\B x V B' Y ) and an error 
which is small in the D 2 norm. In practice, this decomposition is not very useful 
because the complexity of the structured component E(f\B' x V B' Y ) is large compared 
to the bounds available on the error / — E(f\B' x V B' Y ). However one can rectify this 
by one further iteration of the above decomposition: 

Lemma 6.9 (Szemeredi regularity lemma). Let f : X x Y — > [0, 1] be measurable, let 
r > 0, and let F : N — > N be an arbitrary increasing function (possibly depending on t). 
Then there exists an integer M = Op sT (l) and a decomposition f = f\ + f 2 + fa where 

• (f\ is structured) We have f\ = E(f\B x V By) for some finite factors B x , By of 
X, Y respectively of complexity at most M; 

• (f 2 is small) We have ||/ 2 ||i,2(xxy) < r. 

• (f 3 is very uniform) We have \\f3\\n>(xxY) < 1/F(M). 

• (Positivity) f\ and /i + / 2 take values in [0, 1]. 

This lemma may not immediately resemble the usual Szemeredi regularity lemma for 
graphs, but it can easily be used to deduce that lemma. See jH]. One can obtain a 
result similar to this from spectral theory, by viewing / as the kernel of an integral 
operator and decomposing / using the singular value decomposition of that operator, 
with /i, / 2 , /3 corresponding to the high, medium, and low singular values respectively. 
However it then takes some effort to ensure that /i and f\ + / 2 are non-negative. 
See |24j for some related discussion. The more "ergodic" approach here, relying on 
conditional expectation, gives worse quantitative bounds but does easily ensure the 
positivity property, which is crucial in many applications. 

Proof. Construct recursively a sequence of integers 

= M < Mi < M 2 < . . . 

by setting M := and M t := A/i_i + 16F(M 4 _i) 8 for i > 1. Then for each % > 0, 
construct recursively factors B l x , B l Y of X, Y of complexity at most Mi by setting B x and 
B Y to be the trivial factors of complexity 0, and then applying Corollary 16. 81 repeatedly 
to let B x , B Y be extensions of B l x l , B l Y x such that 

||/ - E(f\B x V B Y )\\ mxxY) < l/F(M i _ 1 ). 
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The energies ||E(/|£>^ V B Y )\\ 2 L 2i X xY) are monotone increasing in i by Pythagoras' the- 
orem, and are bounded between and 1. Thus by the pigeonhole principle we can find 
1 < i < l/ 7 " 2 f° r which 
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en we see that the claims are easily verified. 











D 

A slight modification of the above argument allows one to simultaneously regularise 
several functions at once using the same partition. More precisely, we have 

Lemma 6.10 (Simultaneous Szemeredi regularity lemma). Let f : X x Y — > [0, 1], 
g : Y x Z -»• [0, 1], h : Z x X -* [0, 1] 6e measurable, let r > 0, and let F : N -> N 6e 
an arbitrary increasing function (possibly depending on r). Then there exists an integer 
M = Of, t {X), factors Bx,B Y ,Bz of X,Y,Z respectively of complexity at most M and 
decompositions f = /i + f 2 + fo, g = g\ + g 2 + g%, h = hi + h 2 + h 3 , where 

• (fi, 9i> hi are structured) We have f\ = E(f\Bx V By), gi = E(g\B Y V Bz), and 
hi=E(f\B z VB x ). 

• (fi, 92, h 2 are small) We have \\f 2 \\mxxY), IMU^rxZ), \\h 2 \\ L ^(ZxX) < r. 

• (h, 93, h are very uniform) We have \\f 3 \\n>(XxY), IMIn^xxY), WhWa^xxY) < 
1/F(M). 

• (Positivity) fi,gi, hi and f\ + f 2 , g\ + g 2 , hi + h 2 take values in [0, 1]. 

We leave the proof of this lemma as an exercise to the reader. With this lemma we 
can now prove Lemma fo. 51 Actually we shall prove a slightly stronger statement, which 
provides more information about the functions /, g, h involved. 

Lemma 6.11 (Strong triangle removal lemma, several variable version). Let (X,fix), 
(Y,[i Y ), (Z,p,z) be probability spaces, and let f : X x Y — > [0, 1], g : Y x Z —> [0, 1], 
and h : Z x X — > [0, 1] be measurable functions such that A 3 (f,g,h) < e for some 
< s < 1. Then there exists factors Bx, By, Bz of X, Y, Z respectively of complexity at 
most O e (l) and sets E x ,y € Bx V By, E YZ € By V Bz, E zx £ Bz V Bx respectively 
with 1e xy (x, y)lE YZ (y, z)1e zx (z, x) vanishing identically, such that 

f(x,y)l ExY (x,y) d/j, x (x)d/j, Y (y), 
ix jy 





9(y,z)lE YZ (y,z) d/j, Y (y)d/j, z (z), 
'Y J z 

h(z, x)l E c (z,x) dnz(z)dfx x (x) < o e _»o(l). 




z Jx 



Note that Lemma 16.111 immediately implies Lemma 16.51 bv setting / := fl Ex > etc. 
This strengthened version of the lemma will come in handy in the next section. 
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Proof. We apply Lemma 16. 101 with < r < 1 and F to be chosen later; for now, one 
should think of r as being moderately small, but not very small compared to e, and 
similarly F will be a moderately growing function. This gives us an integer M = 0^ r (l), 
factors Bx, By, Bz of complexity at most M, and decompositions / = f\ + f 2 + f 3 , etc. 
with the stated properties. In particular 

A 3 (/i + h + fs, 91 + 92 + 93, hi + h 2 + ha) < e. 

The idea shall be to eliminate the uniform errors f 3 ,g 3 ,h 3 , and then the small errors 
f2,92,h2, leaving one with only the structured components f 1 ,g 1 ,h i , which will be easy 
to deal with directly. 

It is easy to eliminate /3, #3, h 3 . Indeed from repeated application of the generalised 
von Neumann inequality ()6.6|) and the D 2 bounds on / 3 , g 3 , h 3 we have 

Aa(/i + f 2 , 9i + 92, h 1 + h 2 )<e + 0(1/ F(M)). (6.7) 

We would now like to similarly eliminate f 2 , g 2 , h 2 . A naive application of the L? bounds 
would give an estimate of the form 

A 3 (/i, 01, h l )<e + 0(r) + 0(1/ F(M)) (6.8) 

but the 0(t) error turns out to be far too expensive for our purposes. Instead we 
proceed in a more "local" fashion as follows. Let E x Y € Bx V By be the set 

E xx := {(x,y) eXxY: f x {x,y) > r 1 / 1 ^ E(f 2 (x,y) 2 \B x V By) < r} 

and define E Y z G By V Bz and E° z x G Bz V Bx similarly. We first observe that / is 
small outside of E XY . Indeed we have (by the B x V £>y-measurability of E XY ) 




<xjy jxjy 



f(x,y) l {E° XY y(.x,y) dnx{x)dn Y (y) = / / fi{x,y)l^ )B (x,y) dfi X {x)dfi Y (.y) 



< / fi(x,y) d/j, x (x)dn Y (y) 



+ / dfl(X)dfi(Y) 

'E(/ 2 (x,y) 2 |BxVBy)>r 



< r 1 / 10 + - 



- / / E(f 2 (x,y) 2 \B x VB Y ) d f i(X)d l 2(Y) 



T 
1/10 | -*" II f 1 1 2 

= o T ^ (l)- 

Let A, B, C be atoms in B x , By, B z respectively such that AxB C E XY , B x C C E YZ , 
and C x A G E zx , and consider the local quantity 

A 3 ((/i + /2)1axb, (gi + 5'2)1bxc, (h + h 2 )l C xA)- 
We can estimate this as the sum of a main term 

As(/i1axB, gi^BxC, hilcxA) 

and three error terms 

0(A 3 (\f 2 \l AxB , 1b x c, 1cxa))+0(A 3 (1 AxB , \g 2 \\-BxCi ^cxa))+0(A 3 (1 AxB , l B xc, \h 2 \lcxA))- 
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By definition of E XY , E YZ , E° zx , we have fi,gi,h > r 1 / 10 on A x B,B x C,C x A 
respectively, and hence the main term is at least 

T ^(^AxbAbxcAcxa)- 

On the other hand, we have by construction 

E(f 2 (x,y) 2 \AxB)<r 
and hence by Cauchy-Schwarz 

A3(|/2|lylxB, lfixC, Icxa) < T A 3 (l yixB , IbxC, ^-Cxa)- 

Similarly for g 2 and h 2 . Thus the error terms are 0(r 2 ^ 10 ) of the main term. If r <C 1 
is chosen sufficiently small, we thus have the local estimate 

A 3 (/i1axB, 91^-BxC, hlcxA) = 0(A 3 ((/i + h)lAxB, (#1 + 92)^BxC, {h + h 2 )l C xA); 

summing this over all A, B, C and using (J6.7J1 and the positivity of f\ + / 2 , g± + g 2 , hi + h 2 
we conclude that 

A 3 (/il^ y ^il^ z ^il E o x ) < O(e) +0(1/F(M)) 

(compare this with Ijfi.Sjl ). Since fi,gi,hx are bounded from below by r 1 / 10 on these 
sets, we thus have 



'■i 




A 3 (1 EV , 1*0.,, l E o z J < O(r-^ e) + 0{t-^/F{M)). 

Now let Ex,y be the subset of E x Y , defined as the union of all products AxB C E x Y of 
atoms A e Bx, B G B Y of size at least /zx(^4), ^y(B) > r/2 M . Since £>x has complexity 
at most M, the union of all atoms in Bx of measure at most r/2 M has measure at most 
r, and thus we see that 

iix x hy(E XjY \E x ,y) = 0{t) 

and hence from preceding computations 

f(x,y)l ExY (x,y) d/j, x (x)d/j, Y (y) = o T _ (l). 
1 x jy 

We define E Y ,z, Ez,x similarly and observe similar bounds. Now suppose that the 
expression l ExY (x,y)lE YZ {y,z)l Ez x (z,x) does not vanish identically, then there exist 
atoms A, B, C of B x , B Y \ B z with Ax B C E x ,y, B x C C E y ,z, andCxAc E z ,x- 
In particular 

A 3 (1axb,IbxcAcxa) < O(r-^ w e)+O(r- 3 / 10 /E(M)). 

On the other hand we have 

A 3 (1axb, Ibxc, Icxa) = I x x {A) I Xy{B)h z {C) > (r/2 M ) 3 . 

If we define F(M) := [2 3M /r 3 J + 1, and assume that e is sufficiently large depending 
on r (noting that M = Op )T (l) = O r (l)), we obtain a contradiction. Thus we see 
that l Ex Y (x,y) Ie y z {Vi z ) ^e z x ( z ^ x ) vanishes identically whenever e is sufficiently small 
depending on r. If we then set r to be a sufficiently slowly decaying function of e, the 
claim follows. □ 
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Observe that the actual decay rate o e _>o(l) obtained by the above proof is very slow 
(it decays like the reciprocal of the inverse tower-exponential function). It is of interest 
to obtain better bounds here; it is not known what the exact rate should be, although 
the Behrend example (Proposition II .3)1 does show that the decay cannot be polynomial 
in nature. 

The above arguments extend (with some nontrivial difficulty) to hypergraphs, and 
to proving Szemeredi's theorem for progressions of length k > 3; the k = 4 case was 
handled in |2j, [JU] (see also J20] for a more recent proof), and the general case in |33J, 
[3"4] . [32], |HI] and [21] (see also [12], [HI] for more recent proofs). We sketch the k = 4 
arguments here (broadly following the ideas from [12], |H3)- Finding progressions of 
length 4 in a set A is equivalent to solving the simultaneous relations 





-x 2 


-2x 3 


-3x 4 G A 


Xx 




-x 3 


-2x 4 G A 
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+x 2 




-x 4 G A 
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+2x 2 


+X 3 


eA 



Because of this, it is not hard to modify the above arguments to deduce the k = 4 case 
of Szemeredi's theorem from the following lemma: 

Lemma 6.12 (Strong tetrahedron removal lemma, several variable version). Let (X 1; fixi), ■ ■ ■ , (X4, /ixj 
be probability spaces, and for ijk = 123, 234, 341, 412 let fij k : Xi x Xj x X k — > [0, 1] be 
measurable functions such that 

A-4:{fl23, ^234 , /341, /412) < £ 

for some < e < 1, where A4 is the trilinear form 

A 4 (/l23,/234, /341,/412) ■= / •••/ [J. fijk(Xi, Xj, Xfc) d/J, Xl (xi) . . . d/J,x 4 (x 4 ). 

JXi JX4, jjfc = i23, 234,341,412 

Then for each ij = 12, 23, 34, 41, 13, 24 there exists factors Bij of Xj x Xj of complex- 
ity at most O e (l) and sets Eijk G B^ V Bn~ V Bjk for ijk = 123,234,341,412 with 
rij,-fc=i23 234 341 412 ^E ijk (xi, Xj,Xk) vanishing identically, such that 

fij k {xi,Xj,x k )l E - ]k {xi,Xj,x k ) dnx 1 (x 1 )dnx 2 (x 2 )dnx 3 (x3) < o £ ^ (l). 

One can recast this lemma as a statement concerning 3-uniform hypergraphs; see for 
instance [12]. We will however not pursue this interpretation here (but see 0, [TU], 
[3*3*] . [3*lj . |32j . [3*T] . |21j . and [2H] for a treatment of this material from a hypergraph 
perspective) . 

In the case of the triangle removal lemma, it was the D 2 norm which controlled the 
size of A 4 . Now the role is played by the D 3 norm, defined for a measurable bounded 
function f(x, y, z) : X x Y x Z — ► M. of three variables by the formula 

WfWhtxxYxz) : = / / / / / f( x iyi z )f( x ,y, z ')f( x ,y', z )f( x ,y', z ') 

J x J x Jy Jy Jz Jz 

f(x', y, z)f(x f , y, z')f(x', y' , z)f(x, y' , z!) d/j,x(x)d/j,x(x')d/j,Y(y)diJ,Y(y f )d^z(z)diJ,^ 

By modifying the previous arguments we see that the D 3 norm is indeed a norm (after 
equating functions that agree almost everywhere) and that we have the generalised von 
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Neumann inequality 

\^4(f,9,h,k)\ < min(||/||ns, \\g\\n>, \\h\\a>, H^IInO- 
The analogue of Lemma 16.61 is 

Lemma 6.13 (Lack of uniformity implies correlation with structure). Let f : X xY x 
Z — » [— 1, 1] be such that ||/||n 3 > V f or some r\ > 0. Then there exists Ax,y C X x Y , 
A Y .z CY x Z, and A z ^ x G Z x X 

^A XY (x,y)l AYZ (y,z)l Azx (z,x)f(x,y,z) d/j, x (x)d/j, Y (y)diJ,z(z)\ >rf/%. 
i x jy j z 

This ultimately leads to the following regularity lemma: 

Lemma 6.14 (Simultaneous Szemeredi regularity lemma). Forijk = 123, 234, 341, 412, 
1st fijk '■ Xi x Xj x X k — > [0, 1] be measurable, let t > 0, and let F : N — > N be an 
arbitrary increasing function (possibly depending on t). Then there exists an integer 
M = F)T (1), factors B^ ofXiXXj of complexity at most M forij = 12, 23, 34, 41, 13, 24 
and decompositions fij k = fijk,i + fijk,2 + fijk,3 f or ijk = 123, 234, 341, 412 where 

• (fijk,i ^ structured) We have f ijkjl = E(f ijk \B ij V B jk V B ik ). 

• (fijk,2 ^ small) We have 11/^,2 llz^xx^x,,) < r. 

• (f%jk,3 ^ very uniform) We have WfijkjWtJPtXixXjxXu) < l/F(M). 

• (Positivity) fij k ,i and fij k ,i + fijk,2 take values in [0, 1]. 

One would then like to repeat the proof of Lemma 16.1 II bv applying this lemma to de- 
compose each function fij k into three components fij k ,i, fijk,2, fijk,3, and then somehow 
eliminate the latter two terms to reduce to the structured component fij k ,i- The reason 
for doing this is that, as fij k ,i is measurable with respect to the bounded complexity fac- 
tor Bij\/BjkVBi k , one can decompose this function (which is a function of three variables 
Xi,Xj,x k ) as a polynomial combination of functions of just two variables (or more pre- 
cisely, as a linear combination of functions of the form fij(xi,Xj)fj k (xj,x k )fi k (xi,x k )). 
One can then apply a (slight generalisation of) the triangle removal lemma to handle 
such functions; more generally, the strategy is to deduce these sort of removal lemmas 
for functions of k variables, from similar lemmas concerning functions of k — 1 variables. 
In executing this strategy, there is little difficulty in disposing of the very uniform com- 
ponents fijk,3, if one takes advantage of the freedom to make the growth function F 
extremely rapid (one needs to take F to be tower-exponential or faster, to counteract 
the very weak decay present in the two- variable removal lemmas). To dispose of the 
small components fij k ,2 takes a little more work, however. In the above arguments, one 
implicitly used the independence of the underlying factors Bx,By,Bz- In the current 
situation, the factors By are not independent of each other, which makes it difficult 
to eliminate the fij k ,2 factors directly. However, this can be addressed by applying the 
(two-variable) regularity lemma to simultaneously regularise all the atoms in the fac- 
tors Bij, making them essentially indepenent relative to one- variable factors. As one 
might imagine, making this strategy rigorous is somewhat delicate, and in particular 
the various large and small parameters (such as r and F) that appear in the regularity 
lemmas need to be chosen correctly. See for instance J12] for one such realisation of this 
type of argument. More recently, an infinitary approach, using a correspondence princi- 
ple similar in spirit to the Furstenberg correspondence principle, has been employed to 
give a slightly different proof of the above results, in which the various large and small 
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parameters in the argument have been set to infinity or zero, thus leading to a cleaner 
(but less elementary) version of the argument; see |45j . 



7. Relative triangle removal 

The triangle removal result proven in the previous section, Lemma 16.31 only has 
non-trivial content when the underlying graph G is dense, or more precisely when it 
contains more than o £ ^,o(n 2 ) edges, since otherwise one could simply delete all the 
edges in G to remove the triangles. This is related to the fact that Lemma If). HI only 
implies the existence of progressions of length three in dense sets of integers, but not 
in sparse sets. However, it is a remarkable and useful fact that results such as Lemma 
16.31 which ostensibly only apply to dense objects, can in fact be extended "for free" 
to sparse objects, as long as the sparse object has large relative density with respect 
to a sufficiently pseudorandom object. This type of "transference principle" from the 
dense category to the relatively dense category was the decisive new ingredient in the 
result in J2H] that the primes contained arbitrarily long arithmetic progressions. We will 
not prove that result here, however we present a simplified version of that result which 
already captures many of the key ideas. 

If n > 1 is an integer and < p < 1, let G(n,p) be the standard Erdos-Renyi random 
graph on n vertices {1, . . . , n}, in which each pair of vertices defines an edge in G(n,p) 
with an identical independent probability of p. 

Proposition 7.1 (Relative triangle removal lemma). |2H1> |M! Let n > 1 and 1/ logn < 
p < 1, let < e < 1, and let H = G(n,p). Then with probability 1 — o n ^oo ;£ (l) the 
following claim is true: whenever G is a subgraph of H which contains fewer than ep 3 n 3 
triangles, then it is possible to delete o e ^o(p 2 n 2 ) + o n ^ OQ - y£ {p 2 n 2 ) edges from G to create 
a new graph G which contains no triangles whatsoever. 

This result in fact extends to much sparser graphs G{n,p), indeed one can take 
p = n ~ l l 2+s for any fixed < 5 < 1/2; see [28J. This argument proceeded by a 
careful generalisation of the usual regularity lemma to the setting of sparse subsets of 
pseudorandom graphs. As one corollary of their result, one can conclude that if A is a 
random subset of the positive integers with P(n e A) = n~ l l 2+s , and with the events 
n G A being independent, then almost surely every subset of A of positive density would 
contain infinitely many progressions of length three. We shall proceed differently, using 
a "soft" transference argument, inspired by the ergodic theory approach, which follows 
closely the treatment in J2S] (and also 0H]). So far, this argument can only handle 
logarithmic sparsities rather than polynomial, but requires much less randomness on 
the graph G{n,p); indeed a suitably "pseudorandom" graph would also suffice for this 
argument. (For the precise definition of the pseudorandomness needed, see |43j . ) 

Let (X, px) = (y, A*y) = (Z,/iz) be the vertex set {1, . . . ,n} with the uniform dis- 
tribution. Fix the random graph H = G(n,p), and let v[x,y) be the function on 
{1, . . . ,n} x {1, . . . , n} which equals 1/p when (x, y) lies in H and otherwise; we can 
think of v as a function ob.XxY,YxZ,otZxX. Note from Chernoff 's inequality 
that even though v is not bounded by 0(1), with probability 1 — o n _ >00 (l), v has average 
close to 1: 

/ / u(x,y) dfi X (x)dp Y (y) = l + o n ^ 00 (l). 
Jx Jy 
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More sophisticated computations of this sort show that many other correlations of v 
with itself are close to 1. For instance, one can show that with probability 1 — o n ^oo(l)) 
we have the octahedral correlation estimate 




v(x, y)v(x, y')v(x', y)v(x', y') 

v(y, z)u(y, z')u(y', z)v(y\ z') (7.1) 

v{z, x)v{z, x')u(z', x)v{z' ', x') 

dfj 1 x(x)dnx(x')dfj lY (y)dnY(y')dnz(z)dfj, z (z') = 1 + o rwoo (l). 

(In (23!) this estimate, together with some simpler versions, are referred to as the linear 
forms condition on v.) To prove Proposition 17. 1[ it then suffices to prove the following 
variant of Lemma 16.111 

Lemma 7.2 (Relative strong triangle removal lemma, several variable version). Let 
(X,nx); (Y,Hy), (Z,fiz), v be as above, and let < e < 1. With probability 1 — 
o n ^oo; £ (l); the following claim is true: whenever f : X x Y — ► [0, 1], g : Y X Z — ► 
[0,1], and h : Z x X — > [0, 1] be measurable functions such that K^{fv,gv,hv) < e, 
then there exists factors Bx, By, Bz of X, Y, Z respectively of complexity at most O e (l) 
and sets E X) y £ Bx V By, E yz £ By V B z , E z .x £ &z V £>x respectively with 
lE XY ( x :y)^EYz(y^ z )^Ez x( z ' x ) vanishing identically, such that 

f(x,y)u(x,y)l E - XY (x : y) d/j, x (x)d{i Y (y), 
' x jy 

J / 9(y,z)u(y,z)lE YZ (y,z) dfi Y (y)dnz(z), 





h(z,x)u(z,x)l E c x (z,x) d/j, z (z)dnx(x) < o e -+ (l)- 
>z J x 

We leave the deduction of Proposition 17.11 from Lemma f7. 21 as an exercise. Note that 
the only new feature here is the presence of the weight z/, which causes functions such 
as fu to be unbounded. Nevertheless, it turns out to be possible to use arguments 
similar to those in the preceding section and obtain this result with a little effort from 
its unweighted counterpart, Lemma Rj.l 11 

The first thing to do is to check that the generalised von Neumann inequality, ([6.6)1 . 
continues to hold in the weighted setting: 

Lemma 7.3 (Relative generalised von Neumann inequality). |13j Let the notation be 
as above. Then with probability 1 — o n _ >co (l), the following claim is true: whenever 
f : X x Y — > WL, g : Y x Z — > "R and h : Z x X — » K bounded in magnitude by v + 1 
(thus for instance \f(x,y)\ < u(x,y) + 1 for all (x,y) G X x Y, then 

\A 3 {f,9, h)\ < 4min(||/|| n2 , \\g\\ n 2, \\h\\n*) + 0^(1). 



See also 25J for a closely related computation. We also remark that the estimate 
(J6.5J) also continues to hold in this setting because that estimate did not require / to 
be bounded. 

Proof. (Sketch only) By symmetry it suffices to show that 

f(x,y)g(y,z)h(z,x) d/j, x (x)dn Y (y)diJ, z (z)\ < \\f\\ D 2 +o„_ +00 (l). 




x Jy Jz 
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Note that it is easy to verify that \\v + 1||qj = 2 + o n ^oo(l) with high probability, 
and hence ||/||n 2 = 0(1). We eliminate the h function by Cauchy-Schwarz in the z,x 
variables and reduce to showing 




f(x,y)f(x,y')g(y,z)g(y / ,z)(v(z,x) + 1) dfix(x)dfi Y (y)dfi Y (y')dfi Z (z)\ 

I X JY JY J Z 

<8||/||£p+ O^oo(l) 

and then eliminate g by a Cauchy-Schwarz in the y, y', z variables and reduce to showing 
f(x, y)f(x, y')f(x\ y)f(x', y')W(x, x', y, y') d/j, x (x)d/j, x (x')dfi Y (y)diJ, Y (y')\ 




>X JX JY JY 

< 1611/11^ + 0^(1) 
where 

W(x, x', y, y') := / (1/(3/, z) + l){v{y' , z) + l)(v(z, x) + l)(v(z, x') + 1) dfi z {z). 
Jz 

If W = 16 then we would be done by definition of the D 2 norm. So it suffices to show 
that 




f(x,y)f(x,y')f(x',y)f(x',y')\W(x,x',y,y') - 16| 

'X J X JY JY 

dfi X (x)dfix(x')dfi Y (y)dfiY(y')\ < o n _ +00 (l). 
By one last Cauchy-Schwarz this follows from the estimate 

(v(x,y) + l)(is(x,y') + l)(v(x',y) + l)(u(x',y') + l)\W(x,x',y,y') - 16| 2 




'X JX JY JY 

dnx(x)dnx(x')djj,Y(y)dn Y (y')\ < On^oo(l) 
which can be easily verified from correlation estimates such as ()7.1|) . D 

In light of this lemma, we can continue to neglect errors which are small in D 2 norm 
as being negligible. The key to establishing Lemma 17.21 now rests with the following 
decomposition: 

Theorem 7.4 (Structure theorem). (3^3 Let the notation be as above, let f : X x Y — ► 
[0, 1] be a function, and let o > 0. Then there exists a decomposition 

fv = h + h + h 

where f\ is non-negative and obeys the uniform upper bound 

fx(x, y) < 1 for all (x, y) eX x Y, 
f 2 is non-negative and obeys the smallness bound 

f 2 (x,y) dfi X {x)dfi Y (y) = o n _oo ;CT (l), (7.2) 




>X JY 

and f 3 obeys the uniformity estimate 

||/3||n2(xxy) = o a ^ (l). (7.3) 

Furthermore f\ + ^3 is also non-negative. 
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This theorem should be compared with Lemma fo. 91 The key point is that it approxi- 
mates the function fv, for which we have no good uniform bounds, for the function / 1; 
which is bounded by 1. With this theorem (and Lemma f7.3|) it is now a simple matter 
to deduce Lemma [7.21 from Lemma 16. Ill 

Proof of Lemma \7. ii[ We may assume that n is sufficiently large depending on e, as the 
claim is trivial otherwise. Let < a < e be chosen later. We apply Theorem 17.41 to 
decompose fv = /i + f 2 + / 3 , gv = g 1 + g 2 + g 3 , hv = hi + h 2 + h 3 , thus 

Aa(/i + h + h, gi + g2 + gs, h + h 2 + h 3 ) < e. 
Since / x + / 3 , g 1 + g 3 , hi + h 3 , f 2 , g 2 , h 2 are all non-negative, we conclude 

Aa(/i + /a, 0i + #3, fa + h) < e. 
Repeated application of Lemma 17. HI and (|7.Hjl (and the hypothesis a < e) then gives 

A 3 (/i,^i,/ii) < o e _ (l)- 

The functions fi, gi, hi are bounded, so we may apply Lemma lB.lll and obtain £>x, B Y , Bz 
of X, Y, Z respectively of complexity at most O e (l) and sets Ex,y *= $x V £>y, £7y,.z G 
By V Bz, #z,x & B z y B x respectively with l ExY (x, y)l EYZ (y, z)l Ezx (z,x) vanishing 
identically, such that 

fi(x,y)l E c xY (x,y) dfi X (x)d[i Y (y), 

i X JY 

I 9i(y,z)l EYZ (y,z) dfj. Y (y)dfi z (z), 
> Y Jz 





h 1 (z,x)l E c zx (z,x) dfiz(z)dfi X (x) < o e _» (l)- 
'z Jx 

From (|7.2|) we have similar estimates for f 2 , g 2 , h 2 . 

f 2 (x,y)l E o (x,y) dfi X (x)dfi Y (y), 






I x J Y 

g 2 (y,z)l EYz (y,z) d/j, Y (y)d/j, z (z), 
y Jz 

h 2 (z,x)l E c, x (z,x) dfi Z (z)dfi x (x) < o n ^oo ;CT (l). 
z Jx 

Also, from (J7.3J) . (J6.5J) and the complexity bounds on Bx,By,Bz we have similar esti- 
mates for f 3 , g 3 ,h 3 : 

f 3 (x,y)l ExY (x,y) dfi X {x)dfi Y (y), 

' x J Y 

93(y,z)l EYZ (y,z) dfi Y (y)dfi Z (z), 

'Y JZ 

h 3 (z,x)l E c zx (z,x) djj z (z)d/j, x (x) < o ff ^ 0;£ (l). 

'Z JX 
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If we choose o sufficiently small depending on e, we thus have 

f{x,y)l E c xY (x,y) dfi x (x)dfi Y (y), 
i x jy 

I g(y,z)i E ^ z (y,z) dfi Y (y)d^ z (z), 

Jy Jz 

/ / h(z,x)l E c zx (z,x) dfI Z {z)dfi x {x) < O e _ (l) + On^oc; £ (l) 

Jz Jx 
and the claim follows. □ 

Notice how the complexity estimates on Bx,By,Bz were essential in allowing one to 
transfer the unweighted triangle removal lemma, Lemma l6.11[ to the weighted setting, 
Lemma 17.21 

It remains to prove the structure theorem, Theorem 17. 41 A full proof (in much greater 
generality) of this theorem can be found in [33] , while a closely related theorem appears 
in J2S] • We give only a brief summary of the argument here. Broadly speaking, we follow 
the energy increment strategy as used to prove Corollary 16.81 However, we cannot use 
Lemma 16.61 as it only applies for functions / which are bounded. We must therefore 
redefine the notion of "structure", replacing the notion of a tensor product 1a(x)1b(v) 
with the notion of a dual function T>f(x, y) of a function / : X x Y — > K, defined as 

Vf(x,y):= j j f(x,y')f(x',y)f{x',y') d/i x (x')d[j, Y (y'). 
Jx Jy 

Observe that we have the identity 

f(x,y)Vf(x,y) dnx(x)dn Y (y) = Wf\\h>(xxY)- 

'X JY 

Thus if a function / has large D 2 norm then it correlates with its own dual function. This 
fact will be used as a substitute for Lemma 16.61 One key property of dual functions 
are that they can be bounded even when / is unbounded; in particular, with high 
probability we have T>{y + 1) bounded pointwise by 0(1), and hence T>f will also be 
bounded for any / bounded pointwise in magnitude by v + 1. Each of these dual 
functions can define finite factors Bx>f,e for any resolution e > by partitioning the 
range of Vf into intervals of length e and letting Bx>f,e be the factor generated by the 
inverse image of these intervals. (For technical reasons it is convenient to randomly 
shift this partition in order to negate certain boundary effects - which ultimately lead 
to the small error fi appearing in Theorem 17.41 - but let us gloss over this minor detail 
here.) Define a dual factor of complexity M and resolution e to be a factor of the form 
B = Bx>f lt£ V ... V Bx>f M , E where fi, ■ ■ ■ , fjw are bounded in magnitude by v + 1. These 
factors are the counterparts of the factors Bx V By studied in the previous section. 
A crucial feature of these factors is (with high probability) that the random weight 
function v is uniformly distributed with respect all to these factors; more precisely, with 
probability 1 — o n _ >00;£i M(l) we have E(z/|£>) = 1 + o n _ +00; M, £ (l) outside of an exceptional 
set Q = Qq with J x J Y ln(x,y)(v(x,y) + 1) d\ixd\iy = o n _ >00; M,e(l) for all dual factors 
of complexity M. This fact is somewhat nontrivial to prove; one needs to invoke the 
Weierstrass approximation theorem to approximate the indicator function of atoms 
in B by polynomial combinations of the dual functions T>f (with the approximation 
being uniform outside of a small exceptional set Q), and then using tools such as the 
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Gowers-Cauchy-Schwarz inequality one can control the inner product of v with such 
polynomials. See [33], [2S] for details. 

Once one has these dual factors with respect to which v is (essentially) uniformly 
distributed, one can then develop a counterpart of Lemma WH\ which roughly speaking 
asserts that if / is a function bounded in magnitude by u, and B is a dual factor of 
some complexity M and resolution e for which ||/ — E(/|jB)||rja > rj, then with high 
probability one can find an extension B' of B which is a dual factor of complexity M + l 
and resolution e, for which the energy ||E(/|i3')||| 2 has increased from ||E(/|<B)||| 2 by 
some factor c(rj) — o £ ^ (l) — o n ^oo ; Af,e(l) for some c{rj) > 0. This is essentially proven 
by the same Pythagoras theorem argument used to establish Lemma 16.71 though one 
has to take some care because /, being bounded by u, does not enjoy good L 2 bounds 
(though the conditional expectations E(/|i3), E(/|i3') enjoy uniform bounds outside of 
a small exceptional set). One can then iterate this as in the proof of Corollary 16.81 to 
obtain Theorem EU (with some additional o n ^oo ;£ (l) errors arising from exceptional sets 
etc. that can be placed in the small error / 2 ). See jlHj, |2S] for details. 

8. SZEMEREDI'S ORIGINAL PROOF 

We now discuss some of the ideas behind Szemeredi's original proof [HU of his theorem. 
This is a remarkably subtle combinatorial argument, and there is no chance that we 
can describe the full argument here, but we can at least begin to motivate part of the 
argument. Rather than plunge directly into the full setup of the argument, we will 
begin with some naive first attempts at the problem, which do not fully work, but 
which indicate the steps that need to be taken to obtain a full proof. 

The task is, given k > 3, to show that any subset A of integers whose upper density 
5 = 5 [A] := limsup^y^^ [ 2.w + i i s positive contains at least one progression of length 
k. The first idea dates back to the original argument of Roth jSS] for the k = 3 case, 
which is to try to induct downwards on the upper density of the set (this is known as 
the density increment method). If 8 is extremely large, say 5 > 1 — l/2k, then the result 
is easy, because even a randomly chosen progression will have a good chance of being 
entirely contained in A. Now one assumes inductively that A has some given upper 
density 8 > 0, and that the theorem has already been proven for higher values of 5. It 
is not hard to show that the set of S for which Szemeredi's theorem holds must be open, 
so if we can verify in this "maximal bad density" case 12 that progressions of length k 
exist, then we are done. 

Suppose for contradiction that the set A of this critical density 5 did not have any 
progressions of length k, even though all sets of higher density did have progressions. 
What this means is that A cannot contain within it arbitrarily large progressions on 
which A has higher density. In other words, we cannot find a sequence of progressions 
Pi, P 2 , ... in Z with length tending to infinity for which limsup n ^ oc \A fl P n |/|P„| > 5, 
since if this were the case it would not be difficult to piece together out of the A fl P n 
a set with slightly higher upper density than A, but which still had no progressions, 
contradicting the hypothesis on S. Thus we must have limsup n ^ 00 \A fl P n |/|P n | < 8 



This trick is vaguely reminiscent of the reduction to minimal topological dynamical systems, or to 
ergodic measure-preserving systems. Unfortunately these tricks seem to be mutually exclusive; if one 
takes sequences of maximal density then it becomes difficult to convert the argument into a dynamical 

setting. 
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whenever \P n \ — ► oo. In other words, we have the upper bound 

\AnP\<(5 + o lPHoo . A (l))\P\ (8.1) 

for all progressions P. [Incidentally, if we knew Szemeredi's theorem in the first place, 
one would deduce immediately that the only such sets A are those sets with density 
5 = or density 6 = 1, but of course we cannot use Szemeredi's theorem to prove itself 
in such a circular manner!] 

Thus on a long progression P, the density of A cannot significantly exceed S. It is 
still possible for the density of A to be significantly less than 5 on such progressions - 
but this cannot happen too often, as this would (in conjunction with the upper bound) 
eventually cause A itself to have density less than 5. This idea can be easily quantified, 
and leads to the statement is that given any length N, the set 

{neZ: \Af][n,n + N)\ = (8 + on^oo-AVW 
has upper density 1 — ojv-»oo;a(1)- Thus "most" progressions of length N have density 

(5 + 0^^00^(1). 

This then leads to the next idea, which is to partition the integers into blocks [nN, (n+ 
1)N) - progressions of length N, in which n is a multiple of N. Call such a block 
saturated if it has the expected density 5 + Ojv-.-oo;^!), thus most blocks (in an upper 
density sense) are saturated. Suppose temporarily that we could in fact assume that 
all blocks are saturated. Then we could conclude the argument as follows. We can 
colour the n th block [nN, (n + l)N) in one of 2 N colours depending on how A is situated 
inside that block; more precisely, we can color the block [nN, (n + 1)N) by the set 
{0 < i < N : nN + i e A}. Actually we only need 2^ — 1 colours because the block, 
being saturated, cannot be completely devoid of elements of A. We have thus coloured 
all the integers into finitely many colours, and hence by van der Waerden's theorem there 
is a monochromatic progression of blocks of length k. These blocks have A contained in 
them in identical fashions, and the blocks are not completely devoid of elements of A, 
so it is not hard to see that the progression of blocks induces a progression of elements 
of A of the same length, and we are done. 

Unfortunately, life is not so simple, and we have the unsaturated blocks to deal with. 
While the (lower) density of these exceptional blocks is somewhat small in an absolute 
sense - it is On-+oo-,a(1) - it is not very small when compared against the number of 
colours, 2^ — 1 (or against the reciprocal of this number, to be precise). Van der 
Waerden's theorem is nowhere near robust enough to handle such a severe influx of 
"uncoloured" elements. (It can however deal with a rather easy degenerate case in 
which the density of saturated blocks unexpectedly happens to be incredibly close to 1, 
say at least 1 — c(N) for some explicit but extremely small c(N) > whose exact value 
depends on the constants arising from van der Waerden's theorem.) Here we encounter 
a recurring problem in this field: we are always dealing with quantities which are small, 
but not small enough. One is always seeking ways to somehow iteratively improve the 
smallness, or at least convert the smallness to another type of smallness which is more 
robust, in order to get around this basic issue. 

Let's try something else for now. Suppose we can locate k large blocks of integers, 
say [0, N), [N, 2N), . . . ,[(k — 1)N, kN), which are all saturated. (This is not hard since 
the upper density of saturated blocks easily exceeds 1 — 1/2/c when N is large enough.) 
Let's try to find progressions of length k in A with one element in each block. Suppose 
we have somehow (presumably by some sort of an inductive hypothesis) managed to 
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already find many progressions of length k — 1 in A with one element in each of the first 
k — 1 of these blocks. We can extend each of these progressions by one element, which 
will most likely lie in the final block [(k — 1)N, kN). (Some of them will not. However 
observe that A has to be more or less uniformly distributed on any saturated block, 
because on any sub-interval of proportional size, A has to have density not much larger 
5, and thus on subtraction it must have density not much less than 5 either. Because 
of this it is very plausible that a significant fraction of the progressions of k — 1 located 
from the induction step will have k th element in the final block as claimed.) Let B 
denote the set of all such additional elements of these progressions in [(k — 1)N, kN). If 
we had a lot of progressions of length k — 1, it is plausible to expect (by simple counting 
heuristics) that B should have some positive density in [(k — 1)N, kN) (indeed, one 
expects the density to be comparable to 5 k ^ 1 ). If B intersects A, then we are done. 

Unfortunately, B and A are both rather sparse sets inside [(k — 1)N, kN) - one has 
density about S k ~ 1 (assuming some appropriate induction hypothesis), and the other has 
density about S. These are too sparse to force an intersection unconditionally. However, 
we do know that A obeys some good uniform distribution bounds on progressions - its 
density is always bounded from above, and often bounded from below. This would be 
useful if B was somehow made out of progressions (or even better, if the complement 
of B was made out of progressions, since upper bounds on the density of A in the 
complement of B translate to lower bounds on the density of A in B) , but we do not 
have such good structural control on B and it could well be just a generic sparse subset 
of [(k — 1)N, kN), and we are stuck. Indeed, there is nothing right now that stops B 
from simply being some subset of the complement of A, and no matter how structured 
or uniformly distributed A is, we cannot prevent such an event from happening. 

Szemeredi's ingenious solution to this problem is to extend this sequence of k blocks 
in an additional direction, which gives B (and more importantly, the complement of 
B) enough of an "arithmetic progression" structure that one can eventually get lower 
bounds on the density of A in B. 

To get a preliminary idea of how this idea works, suppose that we have a moderately 
long progression of saturated blocks Pi, . . . , Pl, thus we have Pi = [a + ir, a + ir + N) 
for some a G Z and r > N, and 

\A n Pi\ = (5 + o Jv ^ 00;A (l))A^ for all 1 < i < L. (8.2) 

Here L is a moderately large number, though it will be smaller than the length N of 
each block: 1 < L < N. (Given that the set of saturated blocks has upper density 
1 — Ojv_*oo ; a(1), ^ would be unreasonable to hope to obtain a progression of saturated 
blocks of length comparable to N or more.) Let us define Ai C [0,N) to be the set 
A fl Pi, translated backwards by a + ir. 

Now let B C [0, N) be a set of some size aN. Then heuristically we expect Ai fl B 
to have size ~ 5aN. Now, as discussed before, any individual Ai need not have any 
intersection with B. However, once one considers the sequence Ai, . . . ,A^ there is a 
kind of "mixing" phenomenon that forces at least one of the Ai to have at least the 
right number of elements inside B: 

Lemma 8.1 (Single lower mixing). Let P\, . . . , Pl be a progression of saturated blocks, 
with attendant sets A\, . . . , Al C [0, N) and let B C [0, N) be a set of cardinality aN. 
Then there exists 1 < % < L such that 

\AnB\>(a5- o^ oo;A (l))7V. 



COMBINATORIAL AND ERGODIC APPROACHES TO SZEMEREDI 43 

Proof. By summing (|8.2j) for 1 < i < L we have 

L 

\An\JP t \ = (5 + o^^l^NL. 
»=i 

On the other hand, the set U i=1 (Pi\(B + a + ir)) can be viewed as the union of (l — a)N 
arithmetic progressions of length L. Applying ()8.1|) on each such progression and taking 
unions, we obtain 

L 

\A R \J(P t \(B + a + ir))\ < (5 + o L ^ 00;A (l))(l - a)NL. 
Subtracting the latter estimate from the former, we obtain 

L 

\A n (J(£ + a + ir)\ > (5a - o^^l) ~ o N ^ A (l))NL. 
t=i 
Since L < N, the latter error term can be absorbed into the former. The claim then 
follows from the pigeonhole principle, noting that A fl (B + a + ir) is just a translate of 

A^rm □ 

We can amplify this result substantially. Firstly, we may work with multiple sets 
Bi, ... , B m instead of a single set B. 

Lemma 8.2 (Multiple lower mixing). Let Pi, . . . ,Pl be a progression of saturated blocks, 
with attendant sets Ai, . . . , Al C [0, N) and let Bi, . . . , B m C [0, N) be sets of cardinality 
ctiN, . . . , a m N respectively. Then there exists 1 < i < L such that 

\A H Bj\ > (oijS - o L ^ 00 . j4im (l))A^ for all 1 < j < m. 

Proof. Suppose that this claim failed. Then for each 1 < i < L there exists a j for 
which 

\A { r\Bj\< [oijS - o L ^ 00;j4im (l))AT. 
This is an m-colouring of {1, . . . , L}. By van der Waerden's theorem, {1, . . . , L} must 
then contain a monochromatic progression of length WL-*oo;m(l)> where co>L^oo;m(l) = 
l/oL_^oo ;m (l) denotes a quantity which goes to infinity as L — > oo for any fixed m. But 
then this contradicts Lemma 18. II if the o() constants are chosen properly. D 

Corollary 8.3 (Multiple mixing). Let P 1; ...,Pl be a progression of saturated blocks, 
with attendant sets Ax, . . . , A L C [0, N) and let Bi, . . . , B m C [0, N) be sets of cardinality 
a±N, . . . , a m N respectively. Then there exists 1 < i < L such that 

\Ai nBj\ = (aijS + OL^oo ; A, m (l))A^ for all 1 < j < m. 

Proof. Apply the preceding lemma, but with m replaced by 2m and with Bj +m : = 
[0, N)\Bj for 1 < j < m. D 

This type of result is useful when m is small compared with L. Since L is in turn 
small compared to N, this means that we can only hope to exploit this mixing property 
when the number m of sets that we wish to be uniformly distributed with respect to A is 
small compared with the size N of the block. At first glance, this will severely limit the 
usefulness of this mixing property; however, we can use the Szemeredi regularity lemma 
to get around this problem (the key point being that the complexity of the partition 
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created by the regularity lemma - which will be m - does not depend on the number of 
underlying vertices, which is essentially N): 

Proposition 8.4 (Graph mixing). Let P l5 . . . , Pl be a progression of saturated blocks, 
with attendant sets A\, ■ ■ ■ , A L C [0, N) and let G\, . . . , G m C [0, N) x [0, N) be bipartite 
graphs connecting two copies of [0, N). Then there exists 1 < i < L such that 

J2 \\{aeA l :(a,b)E G 3 }\-5\{a E [Q,N) : (a, b) E G 3 }\\ = L _ >oo;A , m (iV 2 ) for all 1 < j < m. 

be[0,N) 

This is a remarkably strong assertion that the set Ai becomes uniformly distributed 
with density 8 on the interval [0, N) for many values of i. Note that the error term is 
completely uniform in the graphs G\, . . . , G m (although it does depend of course on the 
number m of graphs involved) and also is independent of N (after normalising out the 
natural 1/N 2 factor). 

Proof. (Sketch) By van der Waerden's theorem as before we can reduce to the case 
m=l. Pick an e > and apply the Szemeredi regularity lemma to G to obtain 
an e- regular approximation to Gx induced by a partition of complexity O e (l). Apply 
Corollary 18.31 to estimate the contribution of the approximation to obtain a net error 
of ol^oo;A,e(N 2 ) + o e ^o(N 2 ). The claim then follows by choosing e to be a sufficiently 
slowly decaying function of L. (One could also proceed here using a weaker regularity 
lemma such as Corollary 16.81 ) □ 

Let us now informally discuss how one can exploit such strong mixing properties to 
extend progressions of length k — 1 to progressions of length k. (Actually, for technical 
inductive reasons we will also need to extend progressions of length % — 1 to progressions 
of length i for 1 < i < k\ we shall return to this point later.) Suppose we have a sequence 
of /c-tuples (Px,i, P%%, ■ ■ ■ , Pfc,i) of saturated blocks for 1 < % < L, where each fc-tuple is 
in progression, and furthermore the final blocks Pk,i, ■ ■ ■ , Pk,L of each fc-tuple are also 
in progression. We can then define sets Aj t i C [0, N) for 1 < j < k and 1 < i < L as 
before by intersecting A with P^ and then translating back to [0, N). We also make 
the assumption that A "looks the same" in the non-final blocks Px,%, ■ ■ ■ , Pk-i,h i n the 
sense that for any 1 < j < k — 1, the sets Ajj are in fact independent of i. Suppose 
also that in each fc-tuple (Px,i, P2,i, ■ ■ ■ , Pk,i), we have found "many" (^> 5 k ^ 1 N 2 , in fact) 
progressions of length k, with the j th element of the progression in P^;, and with the first 
k — 1 elements in A. Note that in fact once a single fc-tuple, say (Pi,i, p2,i, • • • , Pfc,i) has 
this property, then all fc-tuples do, since this property depends only on the distribution 
of A in the non-final blocks P^j, . . . , Pk-i,i and we are assuming that this distribution 
is independent of i. Later we shall address the rather important question of how one 
could construct such a strange sequence of fc-tuples; for now, let us simply assume that 
such a sequence exists. This sequence shows that A has many progressions of length 
k — 1. We now show that some of these progressions of length k — 1 can be extended to 
progressions of length k in A; this is a model of the key inductive step in Szemeredi's 
argument. 

Consider the sets Ax,i, ■ ■ ■ , Ak-x,i, Ak t i in [0, N), which describe the distribution of A 
in the /c-tuple (Pi,j, . . . , Pk,i)- The first k — 1 of these sets are independent of i, while 
the final set A^^ varies in z; however, because the blocks Pk,x, ■ ■ ■ , Pk,L the final set A^ 
obeys the strong mixing properties described earlier. By hypothesis, we have many 
progressions of length k in [0,iV), with the j th element of such progressions lying in 
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Aj ti for 1 < j < k — 1. The k th elements of such progressions can be collected into a 
subset of [0, N) which we shall call B\ we can then get a reasonable lower bound on 
the density of B in [0, N) (roughly speaking, we have \B\ ^> 5 l N). The objective 
is to get B to intersect Ak,% for at least one i, as this will generate a progression of 
length k in A. But this happens for at least one % if L is large enough (depending on S, 
but not on N), thanks to Lemma [8.11 (Note that we did not use the strongest mixing 
properties available; we will utilise those later.) Indeed the intersection of B with Ak,i 
will be rather large, and by arguing slightly more carefully one can then show that the 
i th A;-tuple (A^, . . . , A k ^) will contain quite a large number of progressions of length k 
(> 5 k N 2 , in fact). 

To summarise, by using the mixing properties, we can convert a long sequence of 
fc-tuples of blocks, each of which contain many progressions of length k — 1 in A, into 
a single fc-tuple of blocks, which contains many progressions of length k in A, provided 
that we have the following two additional properties: 

• The distribution of A in the k — 1 non-final blocks of the /c-tuples is fixed as one 
moves along the sequence. 

• The final block of the /c-tuples are in progression as one moves along the sequence. 

This looks like a promising induction-type step. However it cannot by itself be iterated 
to generate progressions of length k unconditionally for two reasons. Firstly, there is 
the minor objection that we will need a generalisation of the above statement in which 
progressions of length k — 1 and k in A are replaced by progressions of length % — 1 and i 
in A for various 1 < % < k. This is not hard to address. The more important objection 
is that we will need a way of generating not only individual /c-tuples of blocks that 
contain progressions of length (say) k — 1 in A, but entire sequences of such /c-tuples 
which obey additional structural properties. 

The key to obtaining this type of superstructure atop a /c-tuple of blocks in jlO! is 
by passing to a "coarser" level, and viewing each block as a single element of Z; the 
saturated blocks (as well as a subset of the saturated blocks which are known as the 
"perfect" blocks) then become subsets of Z. These sets in turn have upper densities, and 
one can also define notions of saturated blocks of these sets, which are thus "blocks of 
blocks" . The point is that the task of finding sequences of /c-tuples of blocks simplifies, 
on moving to this coarser scale, to the task of finding sequences of /c-term progressions, 
which is easier and in fact will follow once one has a suitable /c-tuple of saturated blocks 
at this coarse scale. 

The details are very technical, but let us just mention some brief highlights here. 
Write Aq = A. One picks a large number N for which there are lots of saturated 
blocks of length N (the upper density of such blocks should be 1 — on -^qo;a(^))- We 
subdivide the integers into blocks of length N , and identify the set of such blocks again 
with Z, creating a "coarse scale" view of the set A . (Objects in the coarse scale will 
be subscripted by 1, while objects in the fine scale subscripted by 0.) The saturated 
blocks then form a subset S± of Z of upper density close to 1. Each element of Si 
corresponds to a saturated block, with respect to which A is distributed in one of 2 N ° 
ways. This can be viewed as a colouring of S\ into 2^° colours. One of the colour classes 
must be somewhat prevalent (in particular, occuring with positive upper density); we 
designate this as the "perfect" colour, and let A\ C 5*1 be the associated colour class. 
(The precise definition of "prevalent" is slightly technical - it is sort of an upper density 
"relative" to S\ - and we omit it here.) A 1 has some upper density 8\\ it is possible 
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(after some notational trickery) to run a density increment argument for A\ and reduce 
to the case where A\ obeys an analogue of the bound (J8.1J1 . In particular we can pick 
a large number N\ (much larger than N ) and construct many saturated blocks of A\ 
of length N\. The definition of "saturated" is a little technical; we require that these 
blocks not only contain A\ to approximately the right density (i.e. b\ + on^^a-SXS), 
but also contains S\ to approximately the right density (1 — on -^oo-,a(1), if N\ is large 
enough). This can be done by tinkering with the notion of upper density appropriately, 
as mentioned briefly before; we omit the details. 

Now suppose one has a fe-tuple Pi,...,P). of saturated blocks of A±, and suppose 
that one can find many A;- term progressions with the i th term in P, for 1 < i < k, 
and also in Ai for 1 < i < k — 1. Specifically, let us suppose that for almost all (e.g. 
with density 1 — Ojv ^oo ; a(1)) of the integers n in the middle third of the final block 
Pi, that there are many fc-term progressions ending in n with the first k — 1 terms 
in Pi fl Ai, ? 2 nil,... , Pfc-i n A\ respectively Most of these integers n are going 
to also lie in S\ (since S% fills almost all of Pt), and so there should be no difficulty 
obtaining an arithmetic progression of such n of some moderate length Lq (which can 
be a slowly growing function of Nq), thus each element of this progression is the final 
element of a fc-term progression which is mostly in A\. Now recall that each integer in 
this coarse representation corresponds to a block of length N in the original fine-scale 
representation. Thus this arithmetic progression can be identified with a sequence of L 
fc-tuples of such blocks, where the final block in each fc-tuple is in arithmetic progression, 
and all the other blocks have the "perfect" colour. This is essentially the very structure 
we need in order to run our inductive step and convert the progressions with k — 1 
elements in Aq, to progressions with k elements in Aq. 

To summarise, by coarsening the scale it is possible to convert fc-tuples of blocks to 
sequences of fc-tuples of blocks (and more generally to a type of "homogeneous, well- 
arranged" family of fc-tuples, as defined in jUJ]). These sequences can then be traded in 
via the mixing properties to upgrade short progressions in a set A to longer progressions. 
By alternating these two arguments in a moderately sophisticated induction argument 
(passing from fine scales to coarse scales approximately 2 k times), one can start with 
progressions with elements in one of the A sets and eventually upgrade to progressions 
with k elements in the original set A. There are some technical issues at intermediate 
stages of the argument, when descending a scale in a case when only the first i elements of 
a progression are guaranteed to have the perfect colour, when it becomes important that 
the remaining elements are unsaturated. To achieve this, the graph mixing properties in 
Proposition 18 .41 become essential; the progressions are reinterpreted as edges connecting 
the elements of one block to another. We omit the details. 
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